Voice Agents at Scale: ElevenLabs + Twilio + Claude
Voice agents look easy in demos and break in production. Here's the pipeline I run for a small business phone agent: latency budget, barge-in, fallback, monitoring.
The latency budget
Caller speaks. We need to start speaking back within ~800ms or it sounds robotic. That budget breaks down:
| Stage | Target |
|---|---|
| STT (streaming) | ~150ms first partial |
| LLM first token | ~250-400ms |
| TTS first audio chunk | ~150-250ms |
| Network jitter buffer | ~50-100ms |
Anything sequential blows this. The whole pipeline must stream.
The pipeline
- Twilio media stream over WebSocket — raw audio frames.
- Streaming STT — partial transcripts emit while caller is still speaking.
- Endpoint detection — silence threshold + a smart "are they still talking?" model.
- LLM call (streamed) — start as soon as the user pauses, with the partial transcript already buffered.
- Streaming TTS — first audio chunk plays back over the call as the LLM is still emitting tokens.
Barge-in
If the caller starts talking while the agent is mid-sentence, the agent must stop. Hard requirement. Implementation:
- VAD (voice activity detection) on the caller-side stream.
- When VAD detects speech and the agent is mid-utterance, send a "stop playback" frame to Twilio and abort the in-flight TTS request.
- Discard any LLM tokens generated past that point.
Fallback paths
- If LLM latency > 2s, play a "let me check on that..." stall token while waiting.
- If LLM fails 3 times, hand off to human voicemail.
- If TTS provider degrades, fall back to a second TTS provider behind the same interface.
What I learned the hard way
- Tokens are not equal. "Yes" takes <100ms to render. A list takes >1s. Prompt the model to lead with a confirmatory phoneme so audio starts fast.
- Background noise breaks STT. Cap STT accuracy expectations and design for "did I hear that right?" loops.
- Caller IDs and accents. Train the agent on phone numbers in spoken form ("eight-five-oh, six-eight-seven, two-zero-eight-five") not digit sequences.
Per-call cost shape
| Component | Per-minute cost shape |
|---|---|
| Twilio voice + media stream | Cents per minute |
| STT | Sub-cent per minute on streaming providers |
| LLM | Depends on call length and model — usually the dominant cost |
| TTS | Cents per 1k characters |
For a 3-minute call, total per-call cost typically lands in the $0.10–$0.30 range with a small frontier model. Bigger model + chattier agent moves that quickly.
Monitoring
- P50/P95 first-audio-chunk latency. Anything past 1s and you'll see hangup rates spike.
- Barge-in success rate.
- Sampled transcript review — listen to 5-10 calls a week.
- Hangup rate by call duration. Sudden hangups under 30s usually mean the agent confused the caller.
Model APIs — vetted picks
Dev tools — vetted picks