Skip to main content
The metrics system samples CPU, memory, disk, and network usage for the workspace server process and its channel plugin subprocesses. It is disabled by default, opt-in via config or a CLI flag, and tunable at runtime without a server restart.

Design goals

  • Process-scoped first. Metrics are collected for the server process and its spawned channel plugins, not the whole machine. System-wide numbers are secondary and collapsible in the UI.
  • Disabled by default. Zero overhead when not needed. Enable persistently via hirocli setup --metrics or ephemerally via hirocli start --metrics.
  • Runtime tunable. Collection interval and enable/disable state can change without a restart via POST /metrics/configure.
  • Non-blocking. All psutil calls run in asyncio.to_thread so they never block the event loop.
  • Cross-platform. psutil exposes a single API on Windows, macOS, and Linux. No platform branching in the collector.

Data models

All models live in hirocli/src/hirocli/services/metrics/models.py.

MetricsSnapshot

Top-level snapshot returned by every REST endpoint and stored in the history ring buffer.
FieldTypeDescription
timestampfloatUnix epoch seconds at collection time
cpuCpuMetricsSystem-wide CPU usage
memoryMemoryMetricsSystem-wide memory usage
processProcessMetrics | NoneMain server process metrics
childrenlist[ChildProcessMetrics]Per-channel-plugin metrics
diskDiskMetrics | NoneDisk usage and I/O rates
networkNetworkMetrics | NoneNetwork throughput rates

ProcessMetrics

FieldDescription
pidOS process ID
cpu_percentCPU usage for this process (all cores, 0–100 × num_cores)
rss_bytesResident set size — physical RAM in use
vms_bytesVirtual memory size — includes shared libs, memory-mapped files
num_threadsActive thread count
RSS is the real memory cost. VMS looks large (often 3–5× RSS) because it includes the Python interpreter’s shared-library mappings. Use RSS to assess actual memory pressure.

ChildProcessMetrics

Same fields as ProcessMetrics plus:
FieldDescription
nameChannel plugin name (e.g. devices)
aliveWhether the subprocess is still running

DiskMetrics

FieldDescription
total_bytesPartition total capacity
used_bytesBytes in use
free_bytesBytes available
percentUsage percentage
read_bytes_per_secRolling I/O read rate
write_bytes_per_secRolling I/O write rate

NetworkMetrics

FieldDescription
bytes_sent_per_secOutbound throughput
bytes_recv_per_secInbound throughput
packets_sent_per_secOutbound packet rate
packets_recv_per_secInbound packet rate

Collector

MetricsCollector is an async service that runs as a coroutine alongside the HTTP server, admin UI, and other services inside the workspace server process.
MetricsCollector wired inside the server process

Collection loop

On each tick (default every 2 seconds):
  1. Collect server process CPU, RSS, VMS, and thread count using a cached psutil.Process handle.
  2. Iterate child PIDs from channel_manager.get_child_processes(). Create or reuse a psutil.Process handle per plugin; mark alive=False for any that have exited.
  3. Collect system-wide CPU percent and memory (total, available, percent).
  4. Read disk usage from the root partition. Compute I/O rates by diffing the current disk_io_counters() against the previous sample divided by elapsed time.
  5. Compute network rates the same way from net_io_counters().
  6. Build a MetricsSnapshot and append it to an in-memory deque ring buffer.
  7. Store the snapshot as latest for O(1) reads.
All psutil calls run inside asyncio.to_thread to avoid blocking the event loop.

Child process injection

The collector receives child PIDs through a callback to avoid circular imports between ChannelManager and MetricsCollector:
# in server_process.py
metrics_collector.set_child_pid_provider(channel_manager.get_child_processes)
ChannelManager stores its subprocesses as dict[str, Popen] (plugin name → process). get_child_processes() returns list[tuple[str, Popen]].

Ring buffer

History is stored in a collections.deque with a configurable maxlen. The default history_size is 1800 entries. At 2 seconds per sample that is 1 hour of history. The buffer is in-memory only — it does not survive restarts (see Phase 8 for future SQLite persistence).

Runtime configuration

POST /metrics/configure accepts:
FieldTypeDescription
enabledboolStart or stop the collection loop
intervalfloatNew collection interval in seconds (minimum 1.0)
history_sizeintResize the ring buffer
Changes take effect immediately without a server restart.

REST API

All endpoints are on the workspace HTTP server (default port 18080).

GET /metrics

Returns the latest MetricsSnapshot. Returns 503 when the collector is not enabled.

GET /metrics/history

Query paramDefaultDescription
minutes5How many minutes of history to return
Returns an array of MetricsSnapshot objects, oldest first.

GET /metrics/status

Returns collector configuration and runtime state:
{
  "enabled": true,
  "interval": 2.0,
  "history_size": 1800,
  "history_count": 423,
  "uptime_seconds": 846.2
}

POST /metrics/configure

Runtime reconfiguration. All fields are optional.
{ "enabled": true, "interval": 5.0 }

Config fields

Three fields in config.json control the collector’s startup defaults:
FieldDefaultDescription
metrics_enabledfalseWhether to start the collector at server startup
metrics_interval2.0Collection interval in seconds
metrics_history_size1800Ring buffer capacity (snapshots)
These are set persistently via hirocli setup --metrics --metrics-interval 3.0. An environment variable HIRO_METRICS=1 overrides metrics_enabled for a single server run (set by hirocli start --metrics). This is the same ephemeral override pattern used by HIRO_ADMIN_UI for the admin UI.

Admin UI dashboard

The Metrics page at /metrics in the admin UI shows five sections:

Controls

  • Enable/disable toggle — calls POST /metrics/configure in-process.
  • Interval slider (1–10 seconds).

Server process

Three ECharts sparkline cards: CPU percent, RSS memory, and thread count. These are the primary metrics — the server’s own footprint.

Channel plugins

A table with a row per spawned channel plugin. Columns: name, alive status, CPU percent, RSS, threads. A totals row sums resource usage across all plugins.

Disk & network

  • Disk card: usage percentage bar and a dual-line chart showing read/write byte rates.
  • Network card: a dual-line chart showing bytes sent and received per second.

System-wide

Collapsible section. System-wide CPU percent and total memory usage. Useful context but not the primary view.

File layout

hirocli/src/hirocli/
├── services/
│   └── metrics/
│       ├── __init__.py          # public exports
│       ├── models.py            # Pydantic models
│       └── collector.py         # MetricsCollector async service
├── runtime/
│   ├── server_process.py        # wires MetricsCollector into asyncio.gather
│   ├── http_server.py           # /metrics REST endpoints
│   └── channel_manager.py      # get_child_processes() method
├── ui/
│   └── pages/
│       └── metrics.py           # NiceGUI admin UI page
├── commands/
│   └── metrics.py               # hirocli metrics CLI command group
└── domain/
    └── config.py                # metrics_enabled, metrics_interval, metrics_history_size

Safety and overhead

Performance impact: psutil calls are lightweight. cpu_percent() blocks for a short interval to measure the delta, so all collection runs in asyncio.to_thread. At the default 2-second interval the overhead is under 0.1% CPU. Memory footprint: The ring buffer holds 1800 small Pydantic models (~500 bytes each), totalling under 1 MB. Threats: None significant. psutil is read-only — it cannot mutate system state. The only risk is a very tight polling interval starving the machine, mitigated by a 1-second floor enforced in POST /metrics/configure. Disabled by default: The collector runs only when explicitly enabled, so it has zero impact on the default server startup.

What is not monitored yet

  • App-level metrics — message throughput, agent latency, queue depths (planned in Phase 6).
  • History persistence — the ring buffer is in-memory only; metrics do not survive server restarts (planned in Phase 8 with SQLite and downsampling).
  • GPU or temperature sensors — require platform-specific code and are out of scope.