Skip to main content
When a "message" type UnifiedMessage arrives at the Communication Manager, it may carry content that the agent cannot directly understand — audio recordings, images, video. The adapter pipeline solves this by enriching each ContentItem in-place before the message is placed on the inbound queue.

Design principles

Enrich in-place, never add or remove items — Adapters write their output into item.metadata["description"] on the existing ContentItem. The original body (audio data, image URL, etc.) is always preserved. No new content items are added, no originals are removed. Concurrent capabilities — Independent adapters (audio and image) run concurrently within a single message via asyncio.gather. A message containing both audio and image content is enriched in parallel. Non-blocking — The entire pipeline runs inside an asyncio.Task spawned per message. receive() on the Communication Manager returns immediately after sending an acknowledgment event, so incoming messages from other devices are never blocked while a transcription or image analysis is in progress. Fail gracefully — If an adapter fails (network error, API quota, etc.) the error is captured and written to item.metadata["adapter_error"]. The message is still queued to the agent, which can then decide how to respond.

Content enrichment example

A voice message arrives with one audio content item:
{
  "content_type": "audio",
  "body": "https://cdn.example.com/voice.ogg",
  "metadata": { "duration_ms": 3400 }
}
After the audio adapter:
{
  "content_type": "audio",
  "body": "https://cdn.example.com/voice.ogg",
  "metadata": {
    "duration_ms": 3400,
    "description": "Hey, can you check the server logs?"
  }
}
On transcription failure:
{
  "content_type": "audio",
  "body": "https://cdn.example.com/voice.ogg",
  "metadata": {
    "duration_ms": 3400,
    "adapter_error": "Transcription service unavailable"
  }
}

Pipeline flow

Message adapter pipeline flow from receive to inbound queue

Adapter classes

The adapter system uses a three-level class hierarchy.

MessageAdapter (ABC)

The minimal interface every adapter must implement:
class MessageAdapter(ABC):
    def can_handle(self, msg: UnifiedMessage) -> bool: ...
    async def adapt(self, msg: UnifiedMessage) -> UnifiedMessage: ...

ContentTypeAdapter (base class)

A Template Method base for the common case: target a specific content_type, loop over matching items, call process_item(), write the result into item.metadata["description"]. Handles matching, iteration, and error capture. Subclasses only implement two things:
class ContentTypeAdapter(MessageAdapter):
    @property
    @abstractmethod
    def target_content_type(self) -> str: ...

    @abstractmethod
    async def process_item(self, item: ContentItem) -> str: ...

Concrete adapters

Each concrete adapter extends ContentTypeAdapter with one property and one method:
class AudioTranscriptionAdapter(ContentTypeAdapter):
    target_content_type = "audio"

    async def process_item(self, item: ContentItem) -> str:
        # call LangChain Whisper, return transcript

class ImageUnderstandingAdapter(ContentTypeAdapter):
    target_content_type = "image"

    async def process_item(self, item: ContentItem) -> str:
        # call LangChain vision model, return description

Adapters

Audio transcription

Class AudioTranscriptionAdapterhiroserver/hirocli/src/hirocli/runtime/adapters/audio_adapter.py Transcribes audio content items using LangChain’s OpenAI Whisper integration (langchain_community.document_loaders.parsers.audio.OpenAIWhisperParser).
  • Trigger: any ContentItem with content_type == "audio"
  • Reads: item.body — a URL, data URI, or raw base64 audio payload
  • Writes: transcript text into item.metadata["description"]
  • Side effect: optionally sends a message.transcribed event back to the device once transcription is complete
  • Disabled when: OPENAI_API_KEY environment variable is not set (can_handle returns False)

Image understanding

Class ImageUnderstandingAdapterhiroserver/hirocli/src/hirocli/runtime/adapters/image_adapter.py Describes image content items using a LangChain multimodal vision model.
  • Trigger: any ContentItem with content_type == "image"
  • Reads: item.body — a URL, data URI, or raw base64 image payload
  • Writes: description text into item.metadata["description"]
  • Default model: openai:gpt-4o-mini (override with IMAGE_VISION_MODEL env var)
  • Default prompt: a generic description prompt (override with IMAGE_ANALYSIS_PROMPT env var)
  • Disabled when: OPENAI_API_KEY environment variable is not set (can_handle returns False)

How the Agent Manager uses enriched messages

The Agent Manager builds its agent input by reading from all content items — not just text:
parts = []
for item in msg.content:
    if item.content_type == "text":
        parts.append(item.body)
    elif "description" in item.metadata:
        parts.append(f"[{item.content_type}]: {item.metadata['description']}")

text_body = "\n".join(parts)
A voice message with a transcript arrives at the agent as:
[audio]: Hey, can you check the server logs?
A message with a text caption and an image arrives as:
Can you tell me what's in this photo?
[image]: A kitchen counter with a coffee maker, a stack of books, and a plant near the window.

Adding a future adapter

Adding a new adapter (for example, video or PDF) requires only extending ContentTypeAdapter:
class VideoAdapter(ContentTypeAdapter):
    target_content_type = "video"

    async def process_item(self, item: ContentItem) -> str:
        # call video understanding service
        return description
Register it in the pipeline at server startup in server_process.py:
adapter_pipeline = MessageAdapterPipeline([
    AudioTranscriptionAdapter(),
    ImageUnderstandingAdapter(),
    VideoAdapter(),  # add here
])
No other changes are needed.

See also