The metrics system samples CPU, memory, disk, and network usage for the workspace server process and its channel plugin subprocesses. It is disabled by default, opt-in via config or a CLI flag, and tunable at runtime without a server restart.
Design goals
- Process-scoped first. Metrics are collected for the server process and its spawned channel plugins, not the whole machine. System-wide numbers are secondary and collapsible in the UI.
- Disabled by default. Zero overhead when not needed. Enable persistently via
hirocli setup --metrics or ephemerally via hirocli start --metrics.
- Runtime tunable. Collection interval and enable/disable state can change without a restart via
POST /metrics/configure.
- Non-blocking. All
psutil calls run in asyncio.to_thread so they never block the event loop.
- Cross-platform.
psutil exposes a single API on Windows, macOS, and Linux. No platform branching in the collector.
Data models
All models live in hirocli/src/hirocli/services/metrics/models.py.
MetricsSnapshot
Top-level snapshot returned by every REST endpoint and stored in the history ring buffer.
| Field | Type | Description |
|---|
timestamp | float | Unix epoch seconds at collection time |
cpu | CpuMetrics | System-wide CPU usage |
memory | MemoryMetrics | System-wide memory usage |
process | ProcessMetrics | None | Main server process metrics |
children | list[ChildProcessMetrics] | Per-channel-plugin metrics |
disk | DiskMetrics | None | Disk usage and I/O rates |
network | NetworkMetrics | None | Network throughput rates |
ProcessMetrics
| Field | Description |
|---|
pid | OS process ID |
cpu_percent | CPU usage for this process (all cores, 0–100 × num_cores) |
rss_bytes | Resident set size — physical RAM in use |
vms_bytes | Virtual memory size — includes shared libs, memory-mapped files |
num_threads | Active thread count |
RSS is the real memory cost. VMS looks large (often 3–5× RSS) because it includes the Python interpreter’s shared-library mappings. Use RSS to assess actual memory pressure.
ChildProcessMetrics
Same fields as ProcessMetrics plus:
| Field | Description |
|---|
name | Channel plugin name (e.g. devices) |
alive | Whether the subprocess is still running |
DiskMetrics
| Field | Description |
|---|
total_bytes | Partition total capacity |
used_bytes | Bytes in use |
free_bytes | Bytes available |
percent | Usage percentage |
read_bytes_per_sec | Rolling I/O read rate |
write_bytes_per_sec | Rolling I/O write rate |
NetworkMetrics
| Field | Description |
|---|
bytes_sent_per_sec | Outbound throughput |
bytes_recv_per_sec | Inbound throughput |
packets_sent_per_sec | Outbound packet rate |
packets_recv_per_sec | Inbound packet rate |
Collector
MetricsCollector is an async service that runs as a coroutine alongside the HTTP server, admin UI, and other services inside the workspace server process.
Collection loop
On each tick (default every 2 seconds):
- Collect server process CPU, RSS, VMS, and thread count using a cached
psutil.Process handle.
- Iterate child PIDs from
channel_manager.get_child_processes(). Create or reuse a psutil.Process handle per plugin; mark alive=False for any that have exited.
- Collect system-wide CPU percent and memory (total, available, percent).
- Read disk usage from the root partition. Compute I/O rates by diffing the current
disk_io_counters() against the previous sample divided by elapsed time.
- Compute network rates the same way from
net_io_counters().
- Build a
MetricsSnapshot and append it to an in-memory deque ring buffer.
- Store the snapshot as
latest for O(1) reads.
All psutil calls run inside asyncio.to_thread to avoid blocking the event loop.
Child process injection
The collector receives child PIDs through a callback to avoid circular imports between ChannelManager and MetricsCollector:
# in server_process.py
metrics_collector.set_child_pid_provider(channel_manager.get_child_processes)
ChannelManager stores its subprocesses as dict[str, Popen] (plugin name → process). get_child_processes() returns list[tuple[str, Popen]].
Ring buffer
History is stored in a collections.deque with a configurable maxlen. The default history_size is 1800 entries. At 2 seconds per sample that is 1 hour of history. The buffer is in-memory only — it does not survive restarts (see Phase 8 for future SQLite persistence).
Runtime configuration
POST /metrics/configure accepts:
| Field | Type | Description |
|---|
enabled | bool | Start or stop the collection loop |
interval | float | New collection interval in seconds (minimum 1.0) |
history_size | int | Resize the ring buffer |
Changes take effect immediately without a server restart.
REST API
All endpoints are on the workspace HTTP server (default port 18080).
GET /metrics
Returns the latest MetricsSnapshot. Returns 503 when the collector is not enabled.
GET /metrics/history
| Query param | Default | Description |
|---|
minutes | 5 | How many minutes of history to return |
Returns an array of MetricsSnapshot objects, oldest first.
GET /metrics/status
Returns collector configuration and runtime state:
{
"enabled": true,
"interval": 2.0,
"history_size": 1800,
"history_count": 423,
"uptime_seconds": 846.2
}
POST /metrics/configure
Runtime reconfiguration. All fields are optional.
{ "enabled": true, "interval": 5.0 }
Config fields
Three fields in config.json control the collector’s startup defaults:
| Field | Default | Description |
|---|
metrics_enabled | false | Whether to start the collector at server startup |
metrics_interval | 2.0 | Collection interval in seconds |
metrics_history_size | 1800 | Ring buffer capacity (snapshots) |
These are set persistently via hirocli setup --metrics --metrics-interval 3.0.
An environment variable HIRO_METRICS=1 overrides metrics_enabled for a single server run (set by hirocli start --metrics). This is the same ephemeral override pattern used by HIRO_ADMIN_UI for the admin UI.
Admin UI dashboard
The Metrics page at /metrics in the admin UI shows five sections:
Controls
- Enable/disable toggle — calls
POST /metrics/configure in-process.
- Interval slider (1–10 seconds).
Server process
Three ECharts sparkline cards: CPU percent, RSS memory, and thread count. These are the primary metrics — the server’s own footprint.
Channel plugins
A table with a row per spawned channel plugin. Columns: name, alive status, CPU percent, RSS, threads. A totals row sums resource usage across all plugins.
Disk & network
- Disk card: usage percentage bar and a dual-line chart showing read/write byte rates.
- Network card: a dual-line chart showing bytes sent and received per second.
System-wide
Collapsible section. System-wide CPU percent and total memory usage. Useful context but not the primary view.
File layout
hirocli/src/hirocli/
├── services/
│ └── metrics/
│ ├── __init__.py # public exports
│ ├── models.py # Pydantic models
│ └── collector.py # MetricsCollector async service
├── runtime/
│ ├── server_process.py # wires MetricsCollector into asyncio.gather
│ ├── http_server.py # /metrics REST endpoints
│ └── channel_manager.py # get_child_processes() method
├── ui/
│ └── pages/
│ └── metrics.py # NiceGUI admin UI page
├── commands/
│ └── metrics.py # hirocli metrics CLI command group
└── domain/
└── config.py # metrics_enabled, metrics_interval, metrics_history_size
Safety and overhead
Performance impact: psutil calls are lightweight. cpu_percent() blocks for a short interval to measure the delta, so all collection runs in asyncio.to_thread. At the default 2-second interval the overhead is under 0.1% CPU.
Memory footprint: The ring buffer holds 1800 small Pydantic models (~500 bytes each), totalling under 1 MB.
Threats: None significant. psutil is read-only — it cannot mutate system state. The only risk is a very tight polling interval starving the machine, mitigated by a 1-second floor enforced in POST /metrics/configure.
Disabled by default: The collector runs only when explicitly enabled, so it has zero impact on the default server startup.
What is not monitored yet
- App-level metrics — message throughput, agent latency, queue depths (planned in Phase 6).
- History persistence — the ring buffer is in-memory only; metrics do not survive server restarts (planned in Phase 8 with SQLite and downsampling).
- GPU or temperature sensors — require platform-specific code and are out of scope.