OrionAI Build logo orionai.build

Voice Agents at Scale: ElevenLabs + Twilio + Claude

By OrionAI Build Editorial · Published 2026-05-10 · // build

Voice agents look easy in demos and break in production. Here's the pipeline I run for a small business phone agent: latency budget, barge-in, fallback, monitoring.

The latency budget

Caller speaks. We need to start speaking back within ~800ms or it sounds robotic. That budget breaks down:

StageTarget
STT (streaming)~150ms first partial
LLM first token~250-400ms
TTS first audio chunk~150-250ms
Network jitter buffer~50-100ms

Anything sequential blows this. The whole pipeline must stream.

The pipeline

  1. Twilio media stream over WebSocket — raw audio frames.
  2. Streaming STT — partial transcripts emit while caller is still speaking.
  3. Endpoint detection — silence threshold + a smart "are they still talking?" model.
  4. LLM call (streamed) — start as soon as the user pauses, with the partial transcript already buffered.
  5. Streaming TTS — first audio chunk plays back over the call as the LLM is still emitting tokens.

Barge-in

If the caller starts talking while the agent is mid-sentence, the agent must stop. Hard requirement. Implementation:

Fallback paths

What I learned the hard way

Per-call cost shape

ComponentPer-minute cost shape
Twilio voice + media streamCents per minute
STTSub-cent per minute on streaming providers
LLMDepends on call length and model — usually the dominant cost
TTSCents per 1k characters

For a 3-minute call, total per-call cost typically lands in the $0.10–$0.30 range with a small frontier model. Bigger model + chattier agent moves that quickly.

Monitoring

Model APIs — vetted picks
GPU & compute — vetted picks
Dev tools — vetted picks