Voice Agents at Scale: ElevenLabs + Twilio + Claude

By OrionAI Build Editorial · Published 2026-05-10 · // build

Voice agents look easy in demos and break in production. Here's the pipeline I run for a small business phone agent: latency budget, barge-in, fallback, monitoring.

The latency budget

Caller speaks. We need to start speaking back within ~800ms or it sounds robotic. That budget breaks down:

Stage	Target
STT (streaming)	~150ms first partial
LLM first token	~250-400ms
TTS first audio chunk	~150-250ms
Network jitter buffer	~50-100ms

Anything sequential blows this. The whole pipeline must stream.

The pipeline

Twilio media stream over WebSocket — raw audio frames.
Streaming STT — partial transcripts emit while caller is still speaking.
Endpoint detection — silence threshold + a smart "are they still talking?" model.
LLM call (streamed) — start as soon as the user pauses, with the partial transcript already buffered.
Streaming TTS — first audio chunk plays back over the call as the LLM is still emitting tokens.

Barge-in

If the caller starts talking while the agent is mid-sentence, the agent must stop. Hard requirement. Implementation:

VAD (voice activity detection) on the caller-side stream.
When VAD detects speech and the agent is mid-utterance, send a "stop playback" frame to Twilio and abort the in-flight TTS request.
Discard any LLM tokens generated past that point.

Fallback paths

If LLM latency > 2s, play a "let me check on that..." stall token while waiting.
If LLM fails 3 times, hand off to human voicemail.
If TTS provider degrades, fall back to a second TTS provider behind the same interface.

What I learned the hard way

Tokens are not equal. "Yes" takes <100ms to render. A list takes >1s. Prompt the model to lead with a confirmatory phoneme so audio starts fast.
Background noise breaks STT. Cap STT accuracy expectations and design for "did I hear that right?" loops.
Caller IDs and accents. Train the agent on phone numbers in spoken form ("eight-five-oh, six-eight-seven, two-zero-eight-five") not digit sequences.

Per-call cost shape

Component	Per-minute cost shape
Twilio voice + media stream	Cents per minute
STT	Sub-cent per minute on streaming providers
LLM	Depends on call length and model — usually the dominant cost
TTS	Cents per 1k characters

For a 3-minute call, total per-call cost typically lands in the $0.10–$0.30 range with a small frontier model. Bigger model + chattier agent moves that quickly.

Monitoring

P50/P95 first-audio-chunk latency. Anything past 1s and you'll see hangup rates spike.
Barge-in success rate.
Sampled transcript review — listen to 5-10 calls a week.
Hangup rate by call duration. Sudden hangups under 30s usually mean the agent confused the caller.

Model APIs — vetted picks

Anthropic OpenAI ElevenLabs Cartesia Together AI Groq

GPU & compute — vetted picks

RunPod Vast.ai Modal Replicate Lambda Labs Hetzner

Dev tools — vetted picks

Cursor Aider Continue GitHub Copilot