Audio messages

No backward compatibility is required. Treat this as a clean-slate design and prefer explicit message IDs and events over legacy positional matching.

Recommendation

For chat, transcribe audio in the application/infrastructure layer first, then pass the transcript to the agent as the canonical user message. Do not make transcription a normal agent tool call unless you specifically need multimodal reasoning over the raw audio.

Why this is the better default

The agent always reasons over text, which keeps prompts, memory, and testing simple.
STT failures and retries stay outside the agent loop.
The UI can show a placeholder immediately, then patch in the transcript when it is ready.
You can reuse the transcript for moderation, search, summaries, and analytics.
TTS remains a separate step after the assistant has produced text.

Recommended flow

The device node sends an audio message with a stable message ID.
The server accepts it and emits a receipt/ack event.
STT runs in regular application code, not inside agent decision-making.
The server emits a transcription-complete event linked to the original audio message ID.
The transcript becomes the canonical conversational input to the agent.
The agent produces a text reply.
If voice output is requested, TTS generates assistant audio from that text reply.

Layer responsibilities

Layer	Responsibility
Device / UI	Record audio, upload it, show optimistic placeholder state, patch UI when transcript arrives
Gateway / server orchestration	Validate payloads, store message IDs, run STT, emit status and transcript events, handle retries
Agent	Read transcript text, understand intent, call tools, generate the assistant response
Post-processing	TTS, audio effects, channel formatting, delivery updates

UnifiedMessage shape

Use the original voice input as a normal message with content_type: "audio", then emit a correlated message.transcribed event when STT finishes. This matches the UnifiedMessage model:

routing.id identifies the original voice message
event.ref_id points back to that message
event.data.transcript carries the transcribed text

That design is better than relying on event order or “most recent pending message” heuristics.

What the agent should receive

The agent should receive:

the transcript text
the source message ID
useful metadata such as mime_type, duration_ms, language, or stt_confidence

The transcript text should be the canonical message content used for conversation history and downstream reasoning.

When agent-triggered transcription makes sense

Letting the agent call a transcription tool directly is useful only when you need true multimodal behavior, for example:

the agent must inspect raw audio beyond plain words
tone, hesitation, or other non-text cues matter
different agent branches may choose different extraction strategies
audio inspection is optional and should happen only in special flows

For a normal chat product, this is usually more complexity than value.

Design rules

Use explicit message IDs everywhere.
Emit transcript updates as structured events.
Treat the transcript as the canonical chat input.
Keep STT and TTS outside the core agent reasoning loop.
Handle STT failure as structured state, not as fake text injected into the conversation.
Keep the original audio message for playback, auditing, and future reprocessing.

Avoid

having the agent decide whether a standard chat voice note should be transcribed
coupling transcript and assistant reply by arrival order
mixing audio transport logic into the agent prompt contract
converting transcription failures into user-visible fake messages

Summary

The recommended design is: audio message -> STT in application code -> transcript event -> agent reads text -> assistant text reply -> optional TTS This is the simplest, most testable, and most reusable architecture for chat audio messages.

Design decisions

Local build

Website

Contribution

Recommendation

Why this is the better default

Recommended flow

Layer responsibilities

UnifiedMessage shape

What the agent should receive

When agent-triggered transcription makes sense

Design rules

Avoid

Summary

Design decisions

Local build

Website

Contribution

​Recommendation

​Why this is the better default

​Recommended flow

​Layer responsibilities

​UnifiedMessage shape

​What the agent should receive

​When agent-triggered transcription makes sense

​Design rules

​Avoid

​Summary

Recommendation

Why this is the better default

Recommended flow

Layer responsibilities

UnifiedMessage shape

What the agent should receive

When agent-triggered transcription makes sense

Design rules

Avoid

Summary