No backward compatibility is required. Treat this as a clean-slate design and prefer explicit message IDs and events over legacy positional matching.
Recommendation
For chat, transcribe audio in the application/infrastructure layer first, then pass the transcript to the agent as the canonical user message. Do not make transcription a normal agent tool call unless you specifically need multimodal reasoning over the raw audio.Why this is the better default
- The agent always reasons over text, which keeps prompts, memory, and testing simple.
- STT failures and retries stay outside the agent loop.
- The UI can show a placeholder immediately, then patch in the transcript when it is ready.
- You can reuse the transcript for moderation, search, summaries, and analytics.
- TTS remains a separate step after the assistant has produced text.
Recommended flow
- The device node sends an audio message with a stable message ID.
- The server accepts it and emits a receipt/ack event.
- STT runs in regular application code, not inside agent decision-making.
- The server emits a transcription-complete event linked to the original audio message ID.
- The transcript becomes the canonical conversational input to the agent.
- The agent produces a text reply.
- If voice output is requested, TTS generates assistant audio from that text reply.
Layer responsibilities
| Layer | Responsibility |
|---|---|
| Device / UI | Record audio, upload it, show optimistic placeholder state, patch UI when transcript arrives |
| Gateway / server orchestration | Validate payloads, store message IDs, run STT, emit status and transcript events, handle retries |
| Agent | Read transcript text, understand intent, call tools, generate the assistant response |
| Post-processing | TTS, audio effects, channel formatting, delivery updates |
UnifiedMessage shape
Use the original voice input as a normalmessage with content_type: "audio", then emit a correlated message.transcribed event when STT finishes.
This matches the UnifiedMessage model:
routing.ididentifies the original voice messageevent.ref_idpoints back to that messageevent.data.transcriptcarries the transcribed text
What the agent should receive
The agent should receive:- the transcript text
- the source message ID
- useful metadata such as
mime_type,duration_ms,language, orstt_confidence
When agent-triggered transcription makes sense
Letting the agent call a transcription tool directly is useful only when you need true multimodal behavior, for example:- the agent must inspect raw audio beyond plain words
- tone, hesitation, or other non-text cues matter
- different agent branches may choose different extraction strategies
- audio inspection is optional and should happen only in special flows
Design rules
- Use explicit message IDs everywhere.
- Emit transcript updates as structured events.
- Treat the transcript as the canonical chat input.
- Keep STT and TTS outside the core agent reasoning loop.
- Handle STT failure as structured state, not as fake text injected into the conversation.
- Keep the original audio message for playback, auditing, and future reprocessing.
Avoid
- having the agent decide whether a standard chat voice note should be transcribed
- coupling transcript and assistant reply by arrival order
- mixing audio transport logic into the agent prompt contract
- converting transcription failures into user-visible fake messages
