Metrics monitoring

The metrics system samples CPU, memory, disk, and network usage for the workspace server process and its channel plugin subprocesses. It is disabled by default, opt-in via config or a CLI flag, and tunable at runtime without a server restart.

Design goals

Process-scoped first. Metrics are collected for the server process and its spawned channel plugins, not the whole machine. System-wide numbers are secondary and collapsible in the UI.
Disabled by default. Zero overhead when not needed. Enable persistently via hirocli setup --metrics or ephemerally via hirocli start --metrics.
Runtime tunable. Collection interval and enable/disable state can change without a restart via POST /metrics/configure.
Non-blocking. All psutil calls run in asyncio.to_thread so they never block the event loop.
Cross-platform. psutil exposes a single API on Windows, macOS, and Linux. No platform branching in the collector.

Data models

All models live in hirocli/src/hirocli/services/metrics/models.py.

`MetricsSnapshot`

Top-level snapshot returned by every REST endpoint and stored in the history ring buffer.

Field	Type	Description
`timestamp`	`float`	Unix epoch seconds at collection time
`cpu`	`CpuMetrics`	System-wide CPU usage
`memory`	`MemoryMetrics`	System-wide memory usage
`process`	`ProcessMetrics \| None`	Main server process metrics
`children`	`list[ChildProcessMetrics]`	Per-channel-plugin metrics
`disk`	`DiskMetrics \| None`	Disk usage and I/O rates
`network`	`NetworkMetrics \| None`	Network throughput rates

`ProcessMetrics`

Field	Description
`pid`	OS process ID
`cpu_percent`	CPU usage for this process (all cores, 0–100 × num_cores)
`rss_bytes`	Resident set size — physical RAM in use
`vms_bytes`	Virtual memory size — includes shared libs, memory-mapped files
`num_threads`	Active thread count

RSS is the real memory cost. VMS looks large (often 3–5× RSS) because it includes the Python interpreter’s shared-library mappings. Use RSS to assess actual memory pressure.

`ChildProcessMetrics`

Same fields as ProcessMetrics plus:

Field	Description
`name`	Channel plugin name (e.g. `devices`)
`alive`	Whether the subprocess is still running

`DiskMetrics`

Field	Description
`total_bytes`	Partition total capacity
`used_bytes`	Bytes in use
`free_bytes`	Bytes available
`percent`	Usage percentage
`read_bytes_per_sec`	Rolling I/O read rate
`write_bytes_per_sec`	Rolling I/O write rate

`NetworkMetrics`

Field	Description
`bytes_sent_per_sec`	Outbound throughput
`bytes_recv_per_sec`	Inbound throughput
`packets_sent_per_sec`	Outbound packet rate
`packets_recv_per_sec`	Inbound packet rate

Collector

MetricsCollector is an async service that runs as a coroutine alongside the HTTP server, admin UI, and other services inside the workspace server process.

Collection loop

On each tick (default every 2 seconds):

Collect server process CPU, RSS, VMS, and thread count using a cached psutil.Process handle.
Iterate child PIDs from channel_manager.get_child_processes(). Create or reuse a psutil.Process handle per plugin; mark alive=False for any that have exited.
Collect system-wide CPU percent and memory (total, available, percent).
Read disk usage from the root partition. Compute I/O rates by diffing the current disk_io_counters() against the previous sample divided by elapsed time.
Compute network rates the same way from net_io_counters().
Build a MetricsSnapshot and append it to an in-memory deque ring buffer.
Store the snapshot as latest for O(1) reads.

All psutil calls run inside asyncio.to_thread to avoid blocking the event loop.

Child process injection

The collector receives child PIDs through a callback to avoid circular imports between ChannelManager and MetricsCollector:

# in server_process.py
metrics_collector.set_child_pid_provider(channel_manager.get_child_processes)

ChannelManager stores its subprocesses as dict[str, Popen] (plugin name → process). get_child_processes() returns list[tuple[str, Popen]].

Ring buffer

History is stored in a collections.deque with a configurable maxlen. The default history_size is 1800 entries. At 2 seconds per sample that is 1 hour of history. The buffer is in-memory only — it does not survive restarts (see Phase 8 for future SQLite persistence).

Runtime configuration

POST /metrics/configure accepts:

Field	Type	Description
`enabled`	`bool`	Start or stop the collection loop
`interval`	`float`	New collection interval in seconds (minimum 1.0)
`history_size`	`int`	Resize the ring buffer

Changes take effect immediately without a server restart.

REST API

All endpoints are on the workspace HTTP server (default port 18080).

`GET /metrics`

Returns the latest MetricsSnapshot. Returns 503 when the collector is not enabled.

`GET /metrics/history`

Query param	Default	Description
`minutes`	`5`	How many minutes of history to return

Returns an array of MetricsSnapshot objects, oldest first.

`GET /metrics/status`

Returns collector configuration and runtime state:

{
  "enabled": true,
  "interval": 2.0,
  "history_size": 1800,
  "history_count": 423,
  "uptime_seconds": 846.2
}

`POST /metrics/configure`

Runtime reconfiguration. All fields are optional.

{ "enabled": true, "interval": 5.0 }

Config fields

Three fields in config.json control the collector’s startup defaults:

Field	Default	Description
`metrics_enabled`	`false`	Whether to start the collector at server startup
`metrics_interval`	`2.0`	Collection interval in seconds
`metrics_history_size`	`1800`	Ring buffer capacity (snapshots)

These are set persistently via hirocli setup --metrics --metrics-interval 3.0. An environment variable HIRO_METRICS=1 overrides metrics_enabled for a single server run (set by hirocli start --metrics). This is the same ephemeral override pattern used by HIRO_ADMIN_UI for the admin UI.

Admin UI dashboard

The Metrics page at /metrics in the admin UI shows five sections:

Controls

Enable/disable toggle — calls POST /metrics/configure in-process.
Interval slider (1–10 seconds).

Server process

Three ECharts sparkline cards: CPU percent, RSS memory, and thread count. These are the primary metrics — the server’s own footprint.

Channel plugins

A table with a row per spawned channel plugin. Columns: name, alive status, CPU percent, RSS, threads. A totals row sums resource usage across all plugins.

Disk & network

Disk card: usage percentage bar and a dual-line chart showing read/write byte rates.
Network card: a dual-line chart showing bytes sent and received per second.

System-wide

Collapsible section. System-wide CPU percent and total memory usage. Useful context but not the primary view.

File layout

hirocli/src/hirocli/
├── services/
│   └── metrics/
│       ├── __init__.py          # public exports
│       ├── models.py            # Pydantic models
│       └── collector.py         # MetricsCollector async service
├── runtime/
│   ├── server_process.py        # wires MetricsCollector into asyncio.gather
│   ├── http_server.py           # /metrics REST endpoints
│   └── channel_manager.py      # get_child_processes() method
├── ui/
│   └── pages/
│       └── metrics.py           # NiceGUI admin UI page
├── commands/
│   └── metrics.py               # hirocli metrics CLI command group
└── domain/
    └── config.py                # metrics_enabled, metrics_interval, metrics_history_size

Safety and overhead

Performance impact: psutil calls are lightweight. cpu_percent() blocks for a short interval to measure the delta, so all collection runs in asyncio.to_thread. At the default 2-second interval the overhead is under 0.1% CPU. Memory footprint: The ring buffer holds 1800 small Pydantic models (~500 bytes each), totalling under 1 MB. Threats: None significant. psutil is read-only — it cannot mutate system state. The only risk is a very tight polling interval starving the machine, mitigated by a 1-second floor enforced in POST /metrics/configure. Disabled by default: The collector runs only when explicitly enabled, so it has zero impact on the default server startup.

What is not monitored yet

App-level metrics — message throughput, agent latency, queue depths (planned in Phase 6).
History persistence — the ring buffer is in-memory only; metrics do not survive server restarts (planned in Phase 8 with SQLite and downsampling).
GPU or temperature sensors — require platform-specific code and are out of scope.

Concepts

Components

Plugins

Design goals

Data models

`MetricsSnapshot`

`ProcessMetrics`

`ChildProcessMetrics`

`DiskMetrics`

`NetworkMetrics`

Collector

Collection loop

Child process injection

Ring buffer

Runtime configuration

REST API

`GET /metrics`

`GET /metrics/history`

`GET /metrics/status`

`POST /metrics/configure`

Config fields

Admin UI dashboard

Controls

Server process

Channel plugins

Disk & network

System-wide

File layout

Safety and overhead

What is not monitored yet

Concepts

Components

Plugins

​Design goals

​Data models

​MetricsSnapshot

​ProcessMetrics

​ChildProcessMetrics

​DiskMetrics

​NetworkMetrics

​Collector

​Collection loop

​Child process injection

​Ring buffer

​Runtime configuration

​REST API

​GET /metrics

​GET /metrics/history

​GET /metrics/status

​POST /metrics/configure

​Config fields

​Admin UI dashboard

​Controls

​Server process

​Channel plugins

​Disk & network

​System-wide

​File layout

​Safety and overhead

​What is not monitored yet

Design goals

Data models

`MetricsSnapshot`

`ProcessMetrics`

`ChildProcessMetrics`

`DiskMetrics`

`NetworkMetrics`

Collector

Collection loop

Child process injection

Ring buffer

Runtime configuration

REST API

`GET /metrics`

`GET /metrics/history`

`GET /metrics/status`

`POST /metrics/configure`

Config fields

Admin UI dashboard

Controls

Server process

Channel plugins

Disk & network

System-wide

File layout

Safety and overhead

What is not monitored yet