Voice AI8 min readby Jose M. CobianFact-checked by The Sherlock Team

The Voice AI Stack in 2025: Twilio, ElevenLabs, Vapi, Retell — A Year in Review

2025 was the year voice AI moved from proof-of-concept to production operations. Here is what the year taught operators about running these systems at scale.

TL;DR — The short answer

  • 1

    2025 was the year voice AI moved from pilot to production at scale — and the operational challenges were almost entirely about the seams between providers, not the providers themselves.

  • 2

    The dominant production stack combined Twilio for telephony, ElevenLabs for TTS, and Vapi or Retell for orchestration — each with distinct operational characteristics teams learned through costly trial and error.

  • 3

    The single most important operational lesson from 2025: integration quality between providers matters more than individual provider quality.

  • 4

    Teams entering 2026 without genuine investigative capability — the ability to ask what happened on any call and get a sourced answer in seconds — are carrying the same operational debt that defined 2025 for the teams that fell behind.

The year production reality arrived

At the start of 2025, most voice AI was running in controlled pilots. Developer environments. Limited beta deployments. Demos carefully designed to showcase what worked and avoiding the scenarios that did not. The narrative was still about the AI model — which LLM was most natural, which TTS voice was most convincing, which orchestration platform had the slickest developer experience.
By Q2, the first cohort of companies had voice AI handling real call volume at meaningful scale: appointment scheduling for medical practices, customer support for e-commerce operations, sales qualification for SaaS companies. What they discovered was not what the documentation had prepared them for. The AI model performed well. The telephony performed well. The TTS performed well. But the system — the specific combination of these components under real-world load, with real-world user behaviour and real-world network conditions — performed unpredictably.
2025 was the year operators stopped asking 'which AI sounds most human?' and started asking 'why did this call fail and how do I prevent it from happening again?' The shift from product evaluation to operational management was the defining transition of the year. The teams that made it early built operational advantages that compounded through the year. The teams that made it late spent most of 2025 firefighting.

What Twilio and ElevenLabs taught operators this year

Twilio taught operators the importance of call lifecycle precision. The moment Twilio reports a call as 'completed' is not necessarily the moment the downstream system considers it complete. The gap between Twilio's completion event and the CRM's confirmation — typically 200–800ms but expandable to 2–3 seconds under load — is where a surprising number of CRM duplicate entries, billing discrepancies, and analytics inconsistencies originated. Teams that handled this gap explicitly, with idempotent writes and confirmed-receipt patterns, had significantly cleaner operational data than those that assumed Twilio's completion event was final.
ElevenLabs taught operators about character budget management — the hard way, in most cases. Running out of TTS characters mid-month is not theoretical. It happened to multiple production deployments across 2025. The failure mode was particularly insidious: ElevenLabs returns a 422 error when the character limit is exhausted, but that error often gets swallowed by the voice AI orchestration layer and manifests as a silent call failure rather than an explicit budget alert. Several teams discovered they had been out of budget for days before noticing. The fix is straightforward — character usage monitoring with alerts at 70% and 90% of the monthly limit — but it requires knowing the failure mode exists before building the monitoring.

What Vapi and Retell taught operators about orchestration at scale

Vapi and Retell taught operators about the difference between demo-scale and production-scale conversation state management. Both platforms shine in demos and development: fast to set up, excellent developer experience, solid documentation. Both platforms revealed edge cases in production that are absent from the documentation and visible only under concurrent load.
The most common Vapi production issue in 2025 was webhook timeout management — specifically, the interaction between Vapi's webhook retry logic and idempotency handling in downstream systems. A CRM write that times out and triggers a retry can create duplicate records if the CRM endpoint is not idempotent. This was almost universally a downstream implementation issue, not a Vapi bug, but the failure manifested in Vapi's logs in ways that initially pointed in the wrong direction.
Retell's most common production issue was audio quality degradation under high concurrent load, manifesting as codec negotiation failures that appeared sporadically at normal load but predictably at 50+ concurrent calls. The diagnosis required correlating Retell session logs with Twilio call quality metrics — a cross-provider correlation that most teams did not have tooling for and therefore took days to trace to root cause.

The operational lesson that defined the year

If there is a single sentence that captures what 2025 taught voice AI operators, it is this: the quality of the integration between providers matters more than the quality of any individual provider.
This is not obvious before you have run a production voice AI system. The marketing around AI voice platforms is almost entirely about the AI: the naturalness of the conversation, the latency of the response, the breadth of language support. These are real differentiators at the evaluation stage. But at the operational stage — when you are running thousands of calls per day and need to maintain 97%+ success rates — the AI performance is almost never the variable. The integration seams are.
Teams that built cross-provider observability early in 2025 had a structural advantage: when something went wrong, they knew where it went wrong, which meant they could fix it quickly and prevent recurrence. Teams that relied on individual-provider monitoring spent disproportionate time on incidents that required correlation across providers to diagnose — the ElevenLabs latency that was caused by a Twilio silence configuration, the CRM failure that was triggered by a Vapi retry pattern.

What 2026 must look different

The voice AI teams that will perform well in 2026 are the ones entering the year with genuine investigative capability — not dashboards, but the ability to ask in plain English what happened on any given call and receive a sourced answer in seconds. The ability to catch cost overruns before the invoice arrives. The ability to identify quality degradation before a customer reports it. The ability to onboard new providers — Livekit, Daily.co, the next ElevenLabs competitor — without rebuilding the observability layer from scratch.
The AI models have matured faster than the operations tooling. That is why so much of 2025's engineering time went to debugging rather than building — because debugging without the right tools is labour-intensive work, regardless of how good the AI is. 2026 is the year that gap closes, driven by the cohort of teams that made the operational investment in 2025 and demonstrated what becomes possible when investigation is fast and root cause is always documented.
The case has been building all year. The evidence is now overwhelming.

Explore Sherlock for your voice stack

Frequently asked questions

What was the dominant voice AI stack in 2025?

The most common production voice AI stack in 2025 combined Twilio for telephony, ElevenLabs for text-to-speech, and either Vapi or Retell for AI voice orchestration. Twilio remained the default telephony choice due to its extensive geographic coverage and programmable flexibility. ElevenLabs dominated TTS due to voice quality. Vapi and Retell split the orchestration market roughly by use case — Vapi for developer-heavy teams, Retell for teams wanting more out-of-the-box configuration.

What were the biggest operational lessons from voice AI deployments in 2025?

The three biggest operational lessons: (1) the quality of integration between providers matters more than the quality of any individual provider — most production failures were integration failures, not provider failures; (2) ElevenLabs character budget management needs to be an active operational practice, not a set-and-forget configuration; (3) conversation state management under concurrent load requires explicit architectural attention — in-memory state that works in development fails predictably at production scale.

What did Vapi and Retell teach operators about conversation state in 2025?

Both platforms revealed that stateless AI voice agent configurations handle happy-path scenarios gracefully and handle production edge cases poorly when session state is not explicitly managed. The specific failure mode: concurrent calls sharing server processes where session context is stored in application memory rather than a persistent, session-keyed store. Both platforms now offer explicit session state management options, but teams that relied on default configurations discovered this failure mode through production incidents rather than documentation.

Share

Ready to investigate your own calls?

Connect Sherlock to your voice providers in under 2 minutes. Free to start — 100 credits, no credit card.