"message" type UnifiedMessage arrives at the Communication Manager, it may carry content that the agent cannot directly understand — audio recordings, images, video. The adapter pipeline solves this by enriching each ContentItem in-place before the message is placed on the inbound queue.
Design principles
Enrich in-place, never add or remove items — Adapters write their output intoitem.metadata["description"] on the existing ContentItem. The original body (audio data, image URL, etc.) is always preserved. No new content items are added, no originals are removed.
Concurrent capabilities — Independent adapters (audio and image) run concurrently within a single message via asyncio.gather. A message containing both audio and image content is enriched in parallel.
Non-blocking — The entire pipeline runs inside an asyncio.Task spawned per message. receive() on the Communication Manager returns immediately after sending an acknowledgment event, so incoming messages from other devices are never blocked while a transcription or image analysis is in progress.
Fail gracefully — If an adapter fails (network error, API quota, etc.) the error is captured and written to item.metadata["adapter_error"]. The message is still queued to the agent, which can then decide how to respond.
Content enrichment example
A voice message arrives with one audio content item:Pipeline flow

Adapter classes
The adapter system uses a three-level class hierarchy.MessageAdapter (ABC)
The minimal interface every adapter must implement:
ContentTypeAdapter (base class)
A Template Method base for the common case: target a specific content_type, loop over matching items, call process_item(), write the result into item.metadata["description"]. Handles matching, iteration, and error capture. Subclasses only implement two things:
Concrete adapters
Each concrete adapter extendsContentTypeAdapter with one property and one method:
Adapters
Audio transcription
ClassAudioTranscriptionAdapter — hiroserver/hirocli/src/hirocli/runtime/adapters/audio_adapter.py
Transcribes audio content items using LangChain’s OpenAI Whisper integration (langchain_community.document_loaders.parsers.audio.OpenAIWhisperParser).
- Trigger: any
ContentItemwithcontent_type == "audio" - Reads:
item.body— a URL, data URI, or raw base64 audio payload - Writes: transcript text into
item.metadata["description"] - Side effect: optionally sends a
message.transcribedevent back to the device once transcription is complete - Disabled when:
OPENAI_API_KEYenvironment variable is not set (can_handlereturnsFalse)
Image understanding
ClassImageUnderstandingAdapter — hiroserver/hirocli/src/hirocli/runtime/adapters/image_adapter.py
Describes image content items using a LangChain multimodal vision model.
- Trigger: any
ContentItemwithcontent_type == "image" - Reads:
item.body— a URL, data URI, or raw base64 image payload - Writes: description text into
item.metadata["description"] - Default model:
openai:gpt-4o-mini(override withIMAGE_VISION_MODELenv var) - Default prompt: a generic description prompt (override with
IMAGE_ANALYSIS_PROMPTenv var) - Disabled when:
OPENAI_API_KEYenvironment variable is not set (can_handlereturnsFalse)
How the Agent Manager uses enriched messages
The Agent Manager builds its agent input by reading from all content items — not just text:Adding a future adapter
Adding a new adapter (for example, video or PDF) requires only extendingContentTypeAdapter:
server_process.py:
