Deep DivesFebruary 17, 202610 min readby Jorge

TTFB in Voice AI: Measuring and Optimizing Time to First Audio Byte

Why time-to-first-byte is the most important metric in voice AI, how to measure it across Twilio + ElevenLabs, and optimization strategies that cut TTFB by 60%.

TL;DR — The short answer

1
End-to-end TTFB in voice AI is the sum of LLM inference time, TTS first-chunk latency, and network-to-Twilio overhead — and optimizing only one stage while ignoring the others produces diminishing returns.
2
The caller's experience degrades sharply above 800ms TTFB: callers start interpreting silence as system failure, repeat utterances, or hang up — creating downstream call quality and cost problems that appear disconnected from the latency root cause.
3
Sentence-streaming architecture (submitting the first complete LLM sentence to ElevenLabs before the full response completes) is the highest-impact single architectural change for most deployments and typically reduces effective TTFB by 40–60%.
4
Cold-start overhead — TLS handshake, WebSocket upgrade, and initial frame exchange — adds 80–200ms to every ElevenLabs connection; connection pre-warming eliminates this from the critical path entirely.

Sherlock Holmes measures time to first byte with a stopwatch at a telephone, debugging voice AI latency. — "The caller hung up, Watson. We were one hundred milliseconds too late."

Defining TTFB in voice AI: what the metric actually measures

In web performance, TTFB measures the time from an HTTP request to the first byte of the HTTP response. In voice AI, the equivalent metric is more complex: it is the time from the end of the caller's utterance to the first audio byte reaching the caller's ear. This spans multiple systems and multiple provider boundaries, and it is the metric most directly correlated with caller experience quality.

The voice AI TTFB pipeline in a typical Twilio + ElevenLabs deployment:

1. Speech recognition (if applicable): The caller speaks. Twilio streams audio via media streams or WebRTC to your server. Your server processes audio through a speech-to-text system (Deepgram, Whisper, Twilio's own STT, or the STT built into Vapi/Retell). End-of-utterance detection fires. Typical contribution: 50–200ms for cloud STT; 0ms if you are using your own end-of-utterance detection on the raw audio stream.

2. LLM inference: Your server sends the transcript to the language model (GPT-4o, Claude, Llama, etc.). The LLM generates a response. Depending on your architecture, you wait for either the first token or the complete response before proceeding. Typical contribution: 150–400ms for fast models (GPT-4o-mini, Claude Haiku) to 500–1,200ms for larger models.

3. TTS synthesis: Your server sends the LLM response text to ElevenLabs. ElevenLabs generates the first audio chunk. Your server receives the first audio byte. Typical contribution: 75–200ms for eleven_flash_v2_5 with streaming; 200–400ms for turbo.

4. Network transmission to Twilio: Your server forwards the audio stream to Twilio's media infrastructure. Twilio processes and routes the audio to the caller's handset. Typical contribution: 30–80ms on a well-located server.

Total p50 TTFB for an optimized eleven_flash_v2_5 + fast LLM deployment: 350–600ms. Total p50 for a less optimized deployment (full-response LLM, non-streaming TTS, slow model): 800–1,800ms. The gap between these numbers is the opportunity that TTFB optimization targets.

The cold start problem: where most TTFB variance originates

Cold start latency in voice AI has a different meaning than in serverless computing, where it refers to function initialization time. In voice AI, cold start refers to the overhead of establishing provider connections — specifically ElevenLabs WebSocket connections and, in some architectures, LLM provider connections — at the start of each call rather than maintaining warm connections across calls.

An ElevenLabs WebSocket connection cold start includes: - DNS resolution: 10–50ms (depends on TTL of cached records) - TCP connection: 20–60ms (depends on geographic routing) - TLS handshake: 40–100ms (1-RTT TLS 1.3 typical) - HTTP/1.1 upgrade to WebSocket: 20–40ms - ElevenLabs initial frame exchange: 10–30ms

Total cold start overhead: 100–280ms, added to every synthesis request that uses a cold connection.

In a deployment that creates a new ElevenLabs connection for every call, this overhead is paid on the first synthesis request of every call. For a p50 TTFB of 500ms in an optimized deployment, cold start overhead represents 20–56% of the total. Eliminating it via connection pre-warming (establishing the connection when the call arrives, before the LLM response is ready) directly reduces TTFB by the cold start amount.

The same cold start problem applies to LLM providers that use HTTP/1.1 rather than HTTP/2. HTTP/1.1 connections are not multiplexed — each LLM request that needs a new connection pays TCP + TLS overhead. Most modern LLM providers support HTTP/2, which amortizes connection overhead across requests. Verify that your HTTP client library uses HTTP/2 for LLM provider calls. In Node.js, the default https module uses HTTP/1.1; use undici or a client explicitly configured for HTTP/2 for LLM API calls to reduce per-request connection overhead.

For voice AI deployments running on Kubernetes or containerized infrastructure, warm-up time after deployment (the period before connections are established) also produces elevated TTFB on the first calls to a new pod. Add a readiness probe that includes a synthetic ElevenLabs and LLM request before the pod is marked ready, ensuring pods only receive production traffic after connections are warm.

Measuring TTFB across Twilio and ElevenLabs: instrumentation guide

Accurate TTFB measurement requires stage-level timestamps captured in your application, not estimates from provider dashboards. Provider dashboards measure their own internal latency, not the end-to-end time from the caller's perspective.

The five timestamps to capture on every call:

typescript
interface TTFBEvent {
  call_sid: string;           // Twilio CallSid for correlation
  vapi_call_id?: string;      // if using Vapi
  utterance_end_at: number;   // ms since epoch — when STT fires end-of-utterance
  llm_first_token_at: number; // ms since epoch — first LLM output token
  tts_request_at: number;     // ms since epoch — ElevenLabs request submitted
  tts_first_chunk_at: number; // ms since epoch — first audio byte received
  twilio_first_byte_at: number; // ms since epoch — first audio byte sent to Twilio
  input_char_count: number;   // characters sent to ElevenLabs
  model: string;              // ElevenLabs model name
}

From these fields, compute: - stt_latency = llm_first_token_at - utterance_end_at (0 if no STT; this is your LLM queue + processing time in sentence-streaming architectures) - llm_to_tts_handoff = tts_request_at - llm_first_token_at (should be near 0 in optimal architectures) - tts_first_chunk_latency = tts_first_chunk_at - tts_request_at (ElevenLabs generation time) - tts_to_twilio_latency = twilio_first_byte_at - tts_first_chunk_at (your server processing and forwarding) - total_ttfb = twilio_first_byte_at - utterance_end_at

Store these in a time-series table indexed on call_sid and utterance_end_at. Query percentiles at p50, p90, p95, and p99 weekly. Break down by model, time of day, and input character count to identify patterns. A p95 that is more than 3x the p50 indicates high variance — investigate whether the outliers cluster by time of day (ElevenLabs load) or by input length (prompt verbosity).

For deployments using Vapi or Retell, these platforms expose partial TTFB data in their call event webhooks under latency or timing fields. However, they measure from their own internal perspective, not from the caller's. Use their data as a supplement, not a replacement for application-level timestamps.

eleven_flash_v2_5 streaming: the architecture that makes sub-400ms possible

The 2026 benchmark for an optimized voice AI deployment is under 400ms TTFB at p50. Achieving it requires combining three elements: eleven_flash_v2_5 (the lowest-latency ElevenLabs model), true chunk-streaming from ElevenLabs to Twilio (not buffering), and sentence-streaming from the LLM (not waiting for full response).

ElevenLabs' eleven_flash_v2_5 model is architected specifically for streaming first-chunk delivery. The model begins generating audio from the first few tokens of input, using a predict-ahead mechanism that allows initial phoneme synthesis to begin before the full input has been processed. For inputs under 100 characters, this produces first-chunk latency of 75–120ms under light API load.

The streaming pipeline from ElevenLabs to Twilio: 1. Submit synthesis request to ElevenLabs WebSocket API with streaming=true. 2. Receive first audio chunk (typically 4–8KB of PCM audio) within 75–120ms. 3. Forward chunk to Twilio media stream socket in the same event loop tick — no buffering. 4. Continue receiving and forwarding chunks until stream completion.

The critical implementation requirement in Node.js: the ElevenLabs WebSocket onmessage handler must pipe directly to the Twilio WebSocket send call. Any intermediate buffering — accumulating multiple chunks before forwarding, writing to a file and reading it back, or passing through a transform stream with async operations — adds latency in the forwarding path. The forwarding latency should be under 2ms for a direct pipe.

For Vapi deployments: Vapi handles the ElevenLabs integration internally. The voice.chunkPlan.punctuationBoundaries setting controls when Vapi sends text segments to ElevenLabs. Setting this to include all punctuation (or disabling the chunk plan for minimum-latency mode) ensures Vapi submits text to ElevenLabs as early as possible in the LLM response generation. Review your Vapi assistant configuration's voice settings and compare first-chunk latency with the chunkPlan enabled vs. disabled using the Vapi latency metrics in the dashboard.

WebSocket vs REST for TTS: the latency impact

ElevenLabs offers two delivery mechanisms for TTS synthesis: a REST API (HTTP POST) and a WebSocket API. The choice between them has a significant impact on TTFB and is not always clearly documented in ElevenLabs' guides.

REST API (non-streaming): Submit a POST request with the text. ElevenLabs generates the complete audio file. Returns the complete audio in the response body. First-byte delivery to your server occurs when the complete audio is ready. For a 200-character input on eleven_flash_v2_5, this takes 250–400ms — the full generation time.

REST API (streaming with chunked transfer encoding): Submit a POST request with stream=true in the body. ElevenLabs begins returning audio chunks as they are generated. First-byte delivery to your server occurs when the first chunk is ready — approximately 75–150ms for flash. This is a streaming response over a standard HTTP/1.1 or HTTP/2 connection.

WebSocket API: Establish a persistent WebSocket connection. Send text messages to trigger synthesis. Receive audio chunks as WebSocket binary messages. First-chunk latency is equivalent to the REST streaming endpoint, but the connection overhead is paid once per session (not per request). This is the correct choice for conversation where you are making multiple synthesis requests in a single call.

For single-request, single-call workflows: REST streaming is equivalent in latency to WebSocket but simpler to implement. For multi-turn conversations (which is the standard voice AI case): WebSocket is substantially better because the cold-start overhead of connection establishment is paid once per call, not once per utterance. In a 5-turn conversation, REST requires 5 connections (5 × 100–280ms cold-start); WebSocket requires 1 connection (1 × 100–280ms cold-start, amortized across all 5 turns).

Implementation consideration: WebSocket connections must be managed carefully in concurrent environments. Use a per-call connection lifecycle (open on call start, close on call end) rather than a global pool, to prevent audio bleeding between calls. Ensure the WebSocket close event is handled gracefully and the connection is not reused after receiving an error frame from ElevenLabs.

Sentence-streaming architecture: cutting TTFB by 40-60%

In a full-response architecture, the LLM generates its complete response before any text is submitted to ElevenLabs. This means TTS synthesis does not begin until the LLM finishes — for a 300-character response from GPT-4o-mini, that is 400–700ms of LLM inference time during which ElevenLabs is idle. The caller waits for both stages sequentially.

In a sentence-streaming architecture, TTS synthesis begins on the first complete sentence as soon as the LLM generates the sentence-ending punctuation. For a 300-character, 2-sentence response, the first sentence (approximately 120 characters) is typically complete at 40–50% of total LLM inference time. Submitting this first sentence to ElevenLabs immediately means the caller begins hearing audio while the LLM is still generating the second sentence.

The timing improvement: if LLM inference takes 500ms total and the first sentence completes at 200ms (40% through), and TTS first-chunk for a 120-character input takes 100ms, the caller hears the first audio at 300ms. In the full-response architecture, the caller hears audio at 600ms (500ms LLM + 100ms TTS). The sentence-streaming improvement is 300ms — a 50% reduction in perceived TTFB.

Implementation requirements for sentence-streaming:

Sentence boundary detection in LLM token stream: Buffer tokens and detect sentence-ending punctuation (., !, ?) followed by a space or newline. Flush the buffer to the TTS submission queue when a boundary is detected. The boundary detector must handle edge cases: decimal numbers (1.5), abbreviations (Dr.), and ellipses (...) should not trigger premature flushes.

Synthesis queue management: The second sentence may complete synthesis before the first sentence finishes playing. Maintain a queue of synthesized audio chunks and begin playing the next sentence immediately when the current sentence's audio ends, without gaps.

Handling sentence-straddling prosody: The TTS model optimizes prosody for the submitted text segment. A sentence submitted in isolation may have different intonation than the same sentence submitted with the next sentence appended. Some teams append a placeholder for the next sentence to the current submission to improve prosody continuity — at the cost of slightly longer input (and slightly higher ElevenLabs character charges).

For Retell AI deployments: Retell implements sentence-streaming internally and exposes the response_delay parameter to control the minimum delay before sending text to TTS. Setting response_delay=0 enables maximum-speed sentence streaming. For Vapi: the chunkPlan configuration controls the equivalent behavior.

Benchmark numbers: what sub-400ms TTFB looks like in production

Real-world TTFB benchmarks from optimized voice AI deployments in 2026, measured end-to-end from caller utterance end to first audio byte at caller handset:

Optimized configuration (eleven_flash_v2_5 + sentence streaming + pre-warmed connections + GPT-4o-mini): - p50: 320–380ms - p90: 480–560ms - p95: 580–700ms - p99: 900–1,200ms

Standard configuration (eleven_turbo_v2_5 + full-response + cold connections + GPT-4o): - p50: 750–950ms - p90: 1,100–1,400ms - p95: 1,300–1,800ms - p99: 2,000–3,000ms

Vapi managed deployment (eleven_flash_v2_5 + Vapi default chunkPlan): - p50: 420–550ms - p90: 650–850ms - p95: 780–1,000ms

Retell AI managed deployment (eleven_flash_v2_5 + response_delay=0): - p50: 380–480ms - p90: 580–720ms - p95: 700–900ms

The gap between the optimized configuration and the standard configuration at p50 is 400–600ms. This is audible and measurable in caller experience surveys. Teams that have run A/B tests between these configurations report 15–25% higher caller satisfaction scores for the optimized configuration, and 8–12% lower early-hangup rates (callers who disconnect within the first 15 seconds of the AI speaking).

The p99 variance even in optimized configurations reflects ElevenLabs API load events, LLM inference cluster contention, and network path anomalies — factors outside your control. The correct response to high p99 is not to target p99 with configuration changes (which would require unacceptably long silence timeouts) but to detect p95+ calls in production, correlate them with provider load events, and alert on sustained p95 degradation that signals a systematic issue rather than random tail variance.

Sherlock monitors TTFB percentiles across your Twilio + ElevenLabs stack automatically, alerting when p95 rises above your configured threshold. Connect your accounts at [usesherlock.ai](https://usesherlock.ai/?utm_source=blog&utm_medium=content&utm_campaign=ttfb-guide) to enable TTFB monitoring alongside call quality and cost tracking.

See how Sherlock compares

vs Datadog vs Sentry vs New Relic vs Arize AI vs Langfuse vs Galileo

Explore Sherlock for your voice stack

Twilio ElevenLabs Vapi Retell AI Bland AI Genesys

Frequently asked questions

What exactly is TTFB in the context of voice AI?

In voice AI, TTFB (time-to-first-byte, adapted from web performance terminology) refers to the time elapsed from the end of the caller's utterance to the moment the first audio byte of the AI's response reaches the caller's handset. This is distinct from both the TTS generation latency (the time ElevenLabs takes to produce the first audio chunk) and the LLM response latency (the time the language model takes to generate the first token). End-to-end TTFB in a Twilio + ElevenLabs deployment spans: speech recognition (if applicable), LLM inference, TTS synthesis first-chunk, network transmission to Twilio, and Twilio's media stream processing. Each stage contributes independently to the total, and optimization requires knowing which stage dominates your latency budget. For most deployments in 2026 using eleven_flash_v2_5, TTS first-chunk is the largest single contributor at 75–200ms, followed by LLM inference at 150–800ms depending on model.

What TTFB should I target for a good caller experience?

Research on telephony UX and conversational AI consistently points to a target of under 500ms end-to-end TTFB for a natural-feeling conversational interaction. At 500ms, the pause after the caller speaks and before the AI responds is perceptible but acceptable — within the range of human conversational rhythm. At 800ms, the pause is noticeably slow and callers begin to wonder if the call dropped. At 1,200ms+, a significant fraction of callers interpret the silence as a system failure and either hang up or repeat themselves, causing the AI to process a duplicate utterance. For enterprise voice AI where caller trust is paramount, target under 400ms p50 and under 700ms p95. For lower-stakes automated workflows, under 600ms p50 and under 1,000ms p95 are reasonable targets.

How do I measure TTFB across Twilio and ElevenLabs?

End-to-end TTFB measurement requires timestamps at each stage boundary. The timestamps you need: (1) caller_utterance_end — when your speech-to-text system detects end-of-utterance; (2) llm_first_token — when the LLM generates its first output token; (3) tts_request_sent — when you submit the text to ElevenLabs; (4) tts_first_chunk_received — when your server receives the first audio byte from ElevenLabs; (5) audio_first_byte_to_twilio — when your server forwards the first audio byte to Twilio's media stream. From these: LLM latency = llm_first_token - caller_utterance_end; TTS first-chunk latency = tts_first_chunk_received - tts_request_sent; network-to-Twilio = audio_first_byte_to_twilio - tts_first_chunk_received; total TTFB = audio_first_byte_to_twilio - caller_utterance_end. Log all five timestamps in every call event and compute percentiles weekly.

Why does TTFB vary so much between calls?

TTFB variance in voice AI has four main sources: LLM inference time variance (which is highly dependent on output length — longer responses take more time to generate the first token in some architectures, and inference cluster load affects latency significantly), ElevenLabs API load (which correlates with time of day and ElevenLabs' overall API traffic, peaking during US business hours), input-length sensitivity of TTS (longer text inputs require more processing before the first phoneme can be output), and network path variance (the routing between your server and ElevenLabs' API endpoints varies by milliseconds under normal conditions and by hundreds of milliseconds during BGP instability events). Of these, LLM inference variance and ElevenLabs API load are the two largest contributors to p95–p99 TTFB exceeding p50 by 3–5x.

What is the single highest-impact TTFB optimization for most deployments?

For deployments using a non-streaming LLM-to-TTS pipeline — where the full LLM response is collected before sending to ElevenLabs — switching to sentence-streaming architecture is typically the highest-impact single change. Instead of waiting for the complete LLM response, begin TTS synthesis on the first complete sentence as soon as the LLM generates the period. For a 300-character LLM response that consists of two sentences, the first sentence (approximately 120 characters) completes at roughly 40% of the total LLM inference time. Submitting this first sentence to ElevenLabs immediately means TTS synthesis begins 60% earlier than in the full-response architecture. The caller hears the first audio roughly 400–800ms sooner, which is a perceptible improvement in conversational flow. Sentence-streaming requires careful handling of sentence boundary detection in the LLM token stream and queue management for the second sentence's synthesis.

Ready to investigate your own calls?

Connect Sherlock to your voice providers in under 2 minutes. Free to start — 100 credits, no credit card.

Start for free

← Back to the blog