Voice AI8 min readby Jose M. CobianFact-checked by The Sherlock Team

Voice AI Call Failure Rates: What's Normal, What's a Red Flag, and How to Measure Yours

Industry benchmarks for voice AI call failure rates across Twilio, ElevenLabs, and Vapi — plus how to measure your own failure rate correctly and what to do when it's too high.

TL;DR — The short answer

  • 1

    Industry average call failure rate for production voice AI stacks is 3–8% depending on use case and provider mix — rates above 10% consistently indicate a systemic issue.

  • 2

    Most teams measure failure rate incorrectly: they count 'failed' status calls but miss 'completed' calls with sub-5s duration and 'no-answer' timeouts that inflate the true failure count.

  • 3

    The right denominator for failure rate calculation is all call attempts, including those that never connected — using only connected calls makes the rate look 20–40% lower than it actually is.

  • 4

    ElevenLabs latency-related drops average 1–3% under normal load but spike to 8–15% during API load events, making model and region configuration the highest-leverage variable for TTS-specific failures.

What counts as a failed voice AI call — and what teams typically miss

The naive definition of a call failure is a call with status 'failed' in your telephony provider. This is the number most teams look at first — and it understates the real failure rate by a significant margin.
The complete failure taxonomy for a voice AI stack includes at least four categories that standard dashboards do not surface together. First: explicitly failed calls — the provider could not connect, the number was invalid, or the call was rejected. Second: connected-but-immediately-dropped calls — status shows 'completed' but duration is 0–5 seconds, before any meaningful conversation could occur. Third: no-answer timeouts — the call was initiated and rang but was not answered within the configured window. Fourth: TTS timeout drops — the call connected and the telephony provider shows a normal duration, but ElevenLabs latency exceeded the silence detection threshold and the call was dropped mid-conversation.
That fourth category is the most invisible and often the most expensive. The telephony provider bills the call as completed. The TTS provider logs a successful generation. No error code appears in either system. But the caller experienced a failure — silence, then a dropped call — with zero record of it anywhere.
A complete failure rate measurement requires pulling all four categories and summing them against total initiated calls. Teams that do this for the first time consistently discover their actual failure rate is 30–60% higher than what their dashboard was showing.

Industry benchmarks by provider and use case

Benchmarks vary significantly by provider, use case, and whether the failure includes or excludes the invisible TTS timeout category. The figures below are drawn from production deployments across multiple voice AI stacks.
For Twilio outbound dialer campaigns, the industry average failure rate (explicit failures + no-answers, excluding TTS drops) is 2–5%. This is dominated by dispositioned numbers — invalid, disconnected, or DNC — and by carrier-level blocks on outbound traffic. A rate above 8% consistently indicates a list quality problem or a carrier routing issue rather than a configuration problem.
For Twilio inbound routing on a stable IVR or agent handoff setup, the expected failure rate is 0.5–2%. If you are seeing above 3% on inbound, the problem is almost always in the Twilio webhook delivery chain — your application server is not acknowledging Twilio events fast enough, causing Twilio to timeout and reroute.
For ElevenLabs latency-related drops specifically — TTS generation taking long enough to trigger a silence timeout — the baseline under normal API load is 1–3%. During ElevenLabs API load events or scheduled maintenance windows, this can spike to 8–15%. If your TTS drop rate is chronically above 3% outside of load events, the cause is almost always text length variance in your agent responses (longer-than-expected inputs causing generation time to exceed silence thresholds) or model selection mismatch (using eleven_multilingual_v2 where eleven_turbo_v2_5 would reduce generation time by 40%).
For Vapi cold start failures — the initial connection and first-response latency when a new agent session spins up — the expected failure rate is 2–6% depending on the model tier and compute availability. Cold starts are most expensive in high-concurrency scenarios where many sessions are initiating simultaneously.

How to calculate your real failure rate step by step

Step 1: Pull total initiated calls from Twilio (or your primary telephony provider) for the measurement window. This is your denominator. Use the API: GET /2010-04-01/Accounts/{AccountSid}/Calls.json?StartTime>{window_start} — filter by status to enumerate all initiated calls.
Step 2: Count explicit failures. Pull all calls with status 'failed', 'busy', 'no-answer', and 'canceled'. Sum these.
Step 3: Count connected-but-dropped calls. Pull all calls with status 'completed' and CallDuration < 5. These are calls that technically connected but terminated before any value exchange. Flag them as failures.
Step 4: Count TTS timeout drops. This requires cross-referencing ElevenLabs generation timestamps with your Twilio call event stream for the same call IDs. Calls where ElevenLabs generation_end timestamp is within 2 seconds of the Twilio call_end timestamp — and where the call ended without a normal termination event — are likely TTS timeout drops.
Step 5: Sum steps 2, 3, and 4. Divide by step 1 (total initiated). This is your real failure rate.
For most teams doing this calculation for the first time, step 3 and step 4 together add 30–60% to the number they had from step 2 alone. That difference is not a rounding error — it is real failures that were invisible in the standard dashboard view.

The 3 failure types that inflate your rate artificially

Not every call in your 'failure' bucket represents a preventable problem worth debugging. Three categories inflate the measured rate without representing actionable issues.
List-quality failures account for 40–60% of outbound dialer failure rates in most deployments. These are calls to disconnected numbers, wrong numbers, and numbers registered on a DNC list. They appear in your failure rate as 'failed' or 'no-answer' calls, but they are data quality problems, not stack problems. Excluding calls to numbers flagged as invalid within 30 days of a prior failed attempt removes this category from your operational failure rate without hiding real problems.
User-abandoned short calls look like failures on duration metrics but may be intentional. A call that lasts 8 seconds and ends normally may represent a caller who got the information they needed and hung up — not a failure. Set your 'connected-but-dropped' threshold at 5 seconds rather than 30 to avoid misclassifying intentional short calls as failures.
Carrier-level intermittent blocks produce spiky failure rates that do not correlate with any configuration change you made. If your failure rate doubles for 4 hours and then returns to baseline, the cause is almost always a carrier-level routing issue, not a Twilio configuration problem. Identify these events by timestamp correlation — a sudden spike with no corresponding deployment event is a carrier signal.

What to do when your failure rate is too high — triage order

When your failure rate exceeds the benchmark for your use case, the triage order matters. Starting with the wrong layer wastes days on non-causes.
First, check whether the spike is correlated with a deployment. If yes, the cause is almost certainly in your application layer — a webhook handler that started returning 5xx responses, a new agent configuration with an untested edge case, or a Twilio credential rotation that broke authentication.
Second, check the provider status pages. ElevenLabs API incidents cause TTS drop spikes across all customers simultaneously. Twilio infrastructure issues cause carrier-level failures across regions. If your failure rate spike matches a published incident on either provider's status page, the fix is to wait for the incident to resolve.
Third, check your Twilio webhook response times. Twilio's default timeout for your application webhook response is 15 seconds. If your server is slow — particularly if you are doing synchronous database writes or external API calls before responding — Twilio will time out and reroute the call. Your application thinks it handled the call; Twilio has already moved on.
Fourth, check your ElevenLabs response length distribution. Pull the character count of the last 1,000 TTS generations from the ElevenLabs history API. If you see a long tail — calls generating 500+ characters when the average is 120 — your LLM is occasionally producing verbose responses that cause TTS generation time to exceed your silence threshold. Add a response length cap to the system prompt.
Fifth — only after the above — check your Twilio call flow configuration. Most production failures are in the webhook response or TTS layer, not in the call flow itself. Rebuilding a Twilio Studio flow when the real problem is a 2,000-character LLM response is a common and expensive distraction.

How Sherlock tracks failure rate across providers automatically

The manual failure rate calculation described above — pulling logs from Twilio, cross-referencing with ElevenLabs generation timestamps, classifying TTS timeout drops — takes 3–4 hours per measurement window when done by hand. Done weekly, that is 12–16 hours per month that does not improve the stack; it just measures it.
Sherlock calculates your real failure rate — including the connected-but-dropped and TTS timeout categories that standard dashboards miss — automatically across your connected providers. When a pattern exceeds your configured threshold, it posts the case file directly in Slack: failure count, rate, the provider breakdown, and the failure signatures that indicate where the root cause sits.
The free tier (100 credits per Slack workspace) includes failure rate monitoring across all connected providers. If you want to check your current rate against the benchmarks in this post, connect Twilio and ElevenLabs at [usesherlock.ai](https://usesherlock.ai/?utm_source=blog&utm_medium=content&utm_campaign=failure-rate-benchmarks) — the first analysis runs in under 60 seconds.

Explore Sherlock for your voice stack

Frequently asked questions

What counts as a voice AI call failure?

A voice AI call failure is any call attempt that did not deliver the intended value to the caller. This includes calls with status 'failed' or 'busy' in your telephony provider, but also 'completed' calls with 0s or sub-5s duration (connected but immediately dropped), 'no-answer' timeouts, and calls where the TTS engine generated a response but the audio never reached the caller due to a timeout. Most teams count only the explicitly failed status, which understates their true failure rate by 20–40%.

Is a 5% failure rate bad for voice AI?

It depends on use case and stack. For outbound dialer campaigns on Twilio, a 5% failure rate is within the normal range (industry average 2–5%). For inbound routing on a well-configured stack, 5% is elevated — the benchmark is 0.5–2%. For ElevenLabs TTS-related drops, a 5% rate indicates a configuration problem worth investigating immediately, typically text length variance or model selection mismatch. Context matters: compare your rate to the right benchmark for your specific use case.

How do I compare my failure rate to industry benchmarks?

The only valid comparison is to use the same denominator. Industry benchmarks typically use total call attempts (including no-answers and unanswered outbounds) as the denominator. If you are using only connected calls as your denominator, your rate will look lower than it is and will not match published benchmarks. Recalculate using (failed + completed-under-5s + no-answer) / (total initiated) to get a comparable figure.

Share

Ready to investigate your own calls?

Connect Sherlock to your voice providers in under 2 minutes. Free to start — 100 credits, no credit card.