The dashboard that tells you nothing useful
Most voice AI operations teams have dashboards. They show call volume, average call duration, and a success/failure rate calculated by whoever built the integration and defined by whatever criteria made sense at the time. These metrics have one thing in common: they are all aggregates.
Aggregates are the observational equivalent of noting that a crime scene has furniture and a rug. Technically accurate. Contributing nothing toward a conclusion. An aggregate failure rate of 4% does not tell you whether you have a Twilio problem, an ElevenLabs problem, a CRM problem, or a configuration problem. It tells you that 96% of calls met some minimum threshold and 4% did not — information you could have inferred from the fact that your product is not perfect.
The problem is not that these metrics are wrong. The problem is that they are insufficient for any operational decision that matters. You cannot use an aggregate failure rate to decide which provider to escalate to, which configuration to roll back, or which calling window to investigate. For that, you need the five metrics below.
Metric 1 and 2: Per-provider failure rate and TTS latency by configuration
Per-provider failure rate breaks your aggregate into its components. If your aggregate failure rate is 8% and the breakdown is 2% Twilio, 5% ElevenLabs, and 1% CRM, the investigation begins at ElevenLabs — not at your architecture. If Twilio's per-provider rate is 6% and everyone else is below 1%, the conversation with your account manager starts this afternoon, not next sprint.
The reason most teams use aggregate rates is that per-provider attribution requires correlating call IDs across multiple systems — a step that is technically non-trivial without dedicated tooling. But the effort is worth it: per-provider failure rates are the single fastest path from 'calls are failing' to 'here is who to call about it.'
Average TTS latency by agent configuration is the second metric. One configuration that is 300ms slower than the others will not move your aggregate latency by a meaningful amount — but it will erode conversion on every call it handles, and it will generate unexplained dropout events that look like TTS failures when they are actually threshold violations. Latency by configuration, not latency in aggregate, is what reveals the mis-configured agent before it creates a customer complaint.
Metrics 3, 4, and 5: The ones that predict revenue and quality
Cost per converted call — not cost per call — is metric three. The distinction is critical. Cost per call optimisation rewards cutting spend regardless of outcome. Cost per converted call forces you to evaluate cost and conversion simultaneously. An agent configuration that costs €0.80 per call and converts at 22% has a cost per converted call of €3.64. One that costs €0.25 and converts at 8% has a cost per converted call of €3.13. The cheaper configuration is actually more expensive at the conversion level — and cost-per-call optimisation would send you in the wrong direction.
Cross-provider event correlation rate measures how reliably your team can reconstruct the full sequence of events on any given call. If you can correlate 100% of calls across all providers, you have complete investigative coverage. If your correlation rate is 70%, then 30% of failures are invisible to investigation — you know something went wrong but not what, which means you cannot prevent recurrence. Any correlation rate below 100% represents an investigation blind spot.
Transcript quality score is the fifth metric and the earliest leading indicator of conversion problems. Quality degradation shows up in transcripts weeks before it appears in conversion numbers — by the time the sales report shows a problem, the transcript data has been documenting it for 3–4 weeks. Monitoring transcript quality proactively is the difference between catching a prompt regression before it costs you a month of pipeline and discovering it in the quarterly review.
What these five metrics reveal in the first week
Teams that instrument all five metrics for the first time consistently make three discoveries within seven days. The first: a provider that is significantly underperforming relative to its cost, usually concealed by being averaged with a well-performing provider in the aggregate. This provider is almost always ElevenLabs in cases involving latency, and almost always the CRM integration in cases involving completion rates.
The second discovery: an agent configuration that costs disproportionately per call without delivering proportional conversion. Usually a verbose system prompt generating 400+ character responses on questions that could be answered in 60, or an expensive model tier being used for a call type where a faster, cheaper tier would perform identically. These configurations are typically months old and were never revisited after initial deployment.
The third discovery: a latency pattern that explains a conversion rate decline the team had attributed to pricing, seasonality, or lead quality. In every case, the path from measurement to fix is under an afternoon — because the measurement made the problem legible for the first time. The problem was always there. The metrics simply made it visible.
Frequently asked questions
What metrics should I use to monitor voice AI performance?
The five metrics that actually predict voice AI operational health are: per-provider failure rate (not aggregate), average TTS latency by agent configuration, cost per converted call (not cost per call), cross-provider event correlation rate, and transcript quality score. Aggregate call volume and aggregate success rate are useful for capacity planning but useless for diagnosing what is going wrong.
What is a good TTS latency benchmark for production voice AI?
A production-ready TTS latency target for real-time voice AI on telephony is under 500ms for 95% of requests. p99 latency should remain below 800ms. Above 800ms, you begin approaching the human perception threshold for awkward pauses. Above 1,200ms, a significant percentage of callers disengage or assume the call has failed.
How do I calculate transcript quality score for voice AI?
Transcript quality score measures whether your AI agent's responses addressed the caller's actual intent. A simple implementation: after each call, run the transcript through an LLM with a rubric — did the agent answer the question asked? Did it stay on topic? Did it hallucinate? Score 0–100 per call, average across a time window. More sophisticated implementations use intent classification models trained on your specific call types.
Ready to investigate your own calls?
Connect Sherlock to your voice providers in under 2 minutes. Free to start — 100 credits, no credit card.