Skip to main content
No backward compatibility is required. Treat this as a clean-slate design and prefer explicit message IDs and events over legacy positional matching.

Recommendation

For chat, transcribe audio in the application/infrastructure layer first, then pass the transcript to the agent as the canonical user message. Do not make transcription a normal agent tool call unless you specifically need multimodal reasoning over the raw audio.

Why this is the better default

  • The agent always reasons over text, which keeps prompts, memory, and testing simple.
  • STT failures and retries stay outside the agent loop.
  • The UI can show a placeholder immediately, then patch in the transcript when it is ready.
  • You can reuse the transcript for moderation, search, summaries, and analytics.
  • TTS remains a separate step after the assistant has produced text.
  1. The device node sends an audio message with a stable message ID.
  2. The server accepts it and emits a receipt/ack event.
  3. STT runs in regular application code, not inside agent decision-making.
  4. The server emits a transcription-complete event linked to the original audio message ID.
  5. The transcript becomes the canonical conversational input to the agent.
  6. The agent produces a text reply.
  7. If voice output is requested, TTS generates assistant audio from that text reply.

Layer responsibilities

LayerResponsibility
Device / UIRecord audio, upload it, show optimistic placeholder state, patch UI when transcript arrives
Gateway / server orchestrationValidate payloads, store message IDs, run STT, emit status and transcript events, handle retries
AgentRead transcript text, understand intent, call tools, generate the assistant response
Post-processingTTS, audio effects, channel formatting, delivery updates

UnifiedMessage shape

Use the original voice input as a normal message with content_type: "audio", then emit a correlated message.transcribed event when STT finishes. This matches the UnifiedMessage model:
  • routing.id identifies the original voice message
  • event.ref_id points back to that message
  • event.data.transcript carries the transcribed text
That design is better than relying on event order or “most recent pending message” heuristics.

What the agent should receive

The agent should receive:
  • the transcript text
  • the source message ID
  • useful metadata such as mime_type, duration_ms, language, or stt_confidence
The transcript text should be the canonical message content used for conversation history and downstream reasoning.

When agent-triggered transcription makes sense

Letting the agent call a transcription tool directly is useful only when you need true multimodal behavior, for example:
  • the agent must inspect raw audio beyond plain words
  • tone, hesitation, or other non-text cues matter
  • different agent branches may choose different extraction strategies
  • audio inspection is optional and should happen only in special flows
For a normal chat product, this is usually more complexity than value.

Design rules

  • Use explicit message IDs everywhere.
  • Emit transcript updates as structured events.
  • Treat the transcript as the canonical chat input.
  • Keep STT and TTS outside the core agent reasoning loop.
  • Handle STT failure as structured state, not as fake text injected into the conversation.
  • Keep the original audio message for playback, auditing, and future reprocessing.

Avoid

  • having the agent decide whether a standard chat voice note should be transcribed
  • coupling transcript and assistant reply by arrival order
  • mixing audio transport logic into the agent prompt contract
  • converting transcription failures into user-visible fake messages

Summary

The recommended design is: audio message -> STT in application code -> transcript event -> agent reads text -> assistant text reply -> optional TTS This is the simplest, most testable, and most reusable architecture for chat audio messages.