TutorialsMarch 2, 20269 min readby JoséFact-checked by The Sherlock Team

How to Cut Voice AI Costs by 40% (2026 Guide)

Voice AI costs spiral fast across ElevenLabs, Vapi, and Twilio. This practical guide covers the 5 biggest cost traps and exactly how to fix each one.

TL;DR — The short answer

1
The average voice AI deployment wastes 30–40% of its budget on five fixable cost traps: uncapped response length, wrong model tier, failed-call billing, missing TTS caching, and provider-layer blindness.
2
Switching from cost-per-call to cost-per-converted-call as your primary metric exposes expensive, low-converting configurations within the first 30 days — often saving more than any provider renegotiation.
3
Sherlock Calls surfaces per-call cost breakdowns across Twilio, ElevenLabs, Vapi, and Retell automatically — replacing the manual CSV correlation that most teams spend 2–3 hours on per investigation.

Sherlock Holmes uses magnifying glass to investigate three voice AI cost centres: telephony, TTS, and LLM inference. — "Elementary, Watson — you cannot reduce the total without interrogating all three suspects."

Why voice AI costs spiral — and why the invoice never tells you where

Voice AI is not one cost. It is four or five costs stacked on top of each other, billed by different providers, in different units, on different billing cycles. Twilio charges per minute. ElevenLabs charges per character. Your LLM provider charges per token. Your orchestration platform — Vapi, Retell, or a custom pipeline — charges per minute on top of everything else. None of these invoices reference each other.

The result is that your total cost per call is invisible in any single dashboard. A Vapi deployment advertised at $0.05/min actually costs $0.18–$0.33/min once you include Twilio telephony ($0.01–$0.02/min), the LLM layer ($0.006–$0.06/min depending on model), and ElevenLabs TTS ($0.04–$0.12/min depending on character volume and model). A Retell deployment starts at $0.07/min but reaches $0.13–$0.31/min in production with all components.

Most teams discover the true per-call cost only when the monthly invoices arrive — and by then, the overspend has been compounding for weeks. The five cost traps described below account for the majority of avoidable voice AI spend. Each is fixable with a configuration change, not a provider migration.

Cost trap 1: uncapped agent response length is your biggest hidden expense

ElevenLabs bills per character. Every character your AI agent speaks is a character you pay for. This means your system prompt — specifically, how it constrains response length — is a direct cost control lever.

A concise agent response of 80 characters costs roughly $0.024 at ElevenLabs' Scale tier. A verbose response of 340 characters — which LLMs produce by default when not constrained — costs $0.102. That is a 4.25x cost difference per turn for the TTS layer alone. Over a 4-turn conversation, the verbose agent costs $0.408 in TTS versus $0.096 for the constrained agent. Multiply by 1,000 calls per week and you are looking at $312/week in avoidable TTS spend — roughly $1,250/month from a single prompt configuration.

The fix: Add an explicit character or sentence limit to your agent's system prompt. A directive like

"Keep every response under 2 sentences and 120 characters. If the answer requires more detail, ask the caller if they want you to elaborate."

cuts TTS cost per turn by 50–70% without degrading caller experience — because callers on phone calls prefer short, direct answers.

Also audit your LLM's tendency to generate disclaimers, pleasantries, and filler phrases. Phrases like "That's a great question!" or "I'd be happy to help you with that" cost characters and add nothing to the caller's experience. Strip them in the system prompt.

Cost trap 2: using the wrong model tier for the wrong job

ElevenLabs offers multiple model tiers with different latency and quality profiles — and they all cost the same in credits. Flash v2.5 generates audio in under 75ms. Turbo v2.5 generates at around 300ms with richer emotional depth. Multilingual v2, the highest-quality option, can take 400–700ms under normal load and over 1,000ms under peak API load.

The cost trap is not the credit price — it is the indirect cost of latency-induced failures. When ElevenLabs takes 900ms to generate audio on a call with Twilio's default 5-second silence timeout, you lose 900ms of the caller's patience budget. If the caller paused for 2 seconds, you are at 2,900ms — dangerously close to a timeout that drops the call. The dropped call costs you the full telephony minute, the full TTS character charge, and the conversion opportunity. You paid for everything and got nothing.

The fix: Use Flash v2.5 as your default for real-time conversational agents. Reserve Turbo v2.5 for outbound calls with longer monologues where quality matters more than speed. Never use Multilingual v2 in latency-sensitive call flows unless you have widened Twilio's silence timeout to 8–10 seconds and accepted the caller experience trade-off.

On the LLM side, the same principle applies. If your orchestration platform lets you choose the LLM, use the smallest model that meets your accuracy requirements. Claude 3.5 Haiku or GPT-4o Mini at $0.006/min is 10x cheaper than Claude Opus or GPT-4o at $0.06/min — and for structured conversational flows with clear intents, the quality difference is negligible.

Cost trap 3: paying full price for failed calls

Every voice AI provider bills for usage, not for outcomes. Twilio bills for the call minute whether or not your agent spoke. ElevenLabs bills for characters generated whether or not the audio reached the caller's ear. Vapi bills its platform fee whether or not the conversation converted.

This means failed calls cost nearly as much as successful ones — but deliver zero value. If your deployment has a 5% silent failure rate (calls that completed from the provider's perspective but failed from the caller's perspective), you are burning 5% of your entire voice AI budget on dead weight. At $5,000/month total spend, that is $250/month on calls where the caller heard silence, got dropped early, or experienced a broken handoff.

The insidious part: because no individual provider logs these as failures, the waste is invisible in every dashboard. Twilio shows "completed." ElevenLabs shows "success." Your budget shows "on track." Only cross-provider correlation — matching Twilio call duration against ElevenLabs generation timestamps — reveals which "successful" calls were actually failures.

The fix: Implement a daily count-compare check. Pull total calls from Twilio, total TTS generations from ElevenLabs, and total converted outcomes from your CRM. If calls significantly exceed conversions beyond your expected drop-off rate, you have a silent failure problem inflating your costs. Sherlock Calls runs this correlation automatically and posts a cost-impact estimate in Slack when the gap exceeds your configured threshold.

Cost trap 4: missing TTS caching for repetitive content

If your voice AI agent says "Thank you for calling Acme Corp, how can I help you today?" at the start of every call, you are paying ElevenLabs to generate the same 58 characters thousands of times. At scale, this is one of the simplest and largest cost reductions available.

The math: A greeting of 58 characters, spoken on 1,000 calls per week, consumes 58,000 characters weekly — roughly 232,000 characters per month. At ElevenLabs' Scale tier (~$0.30 per 1,000 characters), that is $69.60/month for a single sentence that sounds identical every time. Cache the audio file and serve it directly from your CDN or media server: the ElevenLabs cost for that sentence drops to a one-time $0.02.

Caching applies to any content that is identical or nearly identical across calls:

Greetings and closings — "Thank you for calling," "Is there anything else I can help with?"
Legal disclosures — "This call may be recorded for quality assurance purposes."
Hold messages and IVR prompts — "Please hold while I transfer you."
Appointment confirmations — The template portion: "Your appointment is confirmed for" (dynamic date/time still generated live).

Teams that implement TTS caching for their top 10 most-repeated phrases typically see a 15–25% reduction in total ElevenLabs character consumption. For high-volume deployments (5,000+ calls/week), the savings can exceed $500/month.

The fix: Identify your top 20 most-spoken phrases by frequency. Pre-generate the audio for any phrase that is spoken identically on more than 100 calls/month. Serve cached audio for those phrases and route only dynamic, caller-specific content through the live TTS API.

Cost trap 5: provider-layer blindness — you cannot cut what you cannot see

The thread connecting every cost trap above is the same: you cannot optimise what you cannot measure at the call level. Monthly invoices tell you total spend. They do not tell you which agent configuration costs 4x more per call than the others. They do not tell you which calling window has a 12% failure rate inflating your telephony and TTS bill simultaneously. They do not tell you that 23% of your ElevenLabs spend last month went to calls that lasted under 8 seconds.

Per-call cost attribution requires correlating billing events across providers. The Twilio CallSid must be matched to the ElevenLabs history item ID and the Vapi or Retell call ID — by timestamp, since no provider stores another provider's identifiers. This correlation is the prerequisite for every optimisation described in this guide.

Done manually, it takes 2–3 hours per investigation: download CSVs from each provider, align timestamps (accounting for the 200–500ms drift between providers), match records, compute per-call totals. Most teams do this once after a surprising invoice and then never again — because the manual process does not scale.

Sherlock Calls performs this cross-provider cost correlation automatically. Connect your Twilio, ElevenLabs, and Vapi or Retell accounts, and ask in Slack: "Which calls cost more than $0.50 this week?" or "What is the average cost per converted call for Agent X?" The per-call breakdown — telephony, TTS, LLM, and platform fees — is returned in seconds, with the specific cost drivers identified.

The free tier covers 100 investigations per workspace. Start with the question that matters most: what is your cost per converted call, not your cost per call? The answer to that question will show you exactly where to cut.

See how Sherlock compares

vs Datadog vs Sentry vs New Relic vs Arize AI vs Langfuse vs Galileo

Explore Sherlock for your voice stack

Twilio ElevenLabs Vapi Retell AI Bland AI Genesys

Frequently asked questions

What is the average cost per minute for a voice AI call?

The effective cost per minute for a production voice AI call ranges from $0.13 to $0.35 when you add all provider layers: telephony ($0.01–$0.02/min), LLM inference ($0.006–$0.06/min), TTS generation ($0.04–$0.12/min), and orchestration platform fees ($0.05–$0.08/min). Most teams underestimate total cost by 40–60% because they only look at the platform's advertised base rate.

How can I reduce ElevenLabs costs without switching providers?

Three changes have the largest impact: switch from eleven_multilingual_v2 to Flash v2.5 for conversations where ultra-low latency matters more than tonal nuance (same credit cost, 75ms vs 300ms+ latency, which reduces silence-timeout-induced retries). Cap your agent's maximum response length in the system prompt to under 150 characters per turn. Set a monthly spending limit in the ElevenLabs dashboard to catch runaway agents before the invoice arrives.

Why is my Vapi bill higher than the advertised $0.05/min?

Vapi's $0.05/min is the platform fee only. Your actual per-minute cost includes the LLM provider (OpenAI, Anthropic), the TTS provider (ElevenLabs, Deepgram), and the telephony provider (Twilio). In production, total cost typically lands between $0.18 and $0.33/min depending on model choices. Check the Vapi cost breakdown in your dashboard to see which layer is the largest contributor.

Does caching reduce TTS costs in voice AI?

Yes. For voice AI applications with repetitive content — greetings, IVR menus, standard disclosures, hold messages — caching pre-generated TTS audio can reduce character consumption by 40–60%. The ROI is highest for applications where 20–30% of spoken content is identical across calls, such as appointment reminders or payment confirmations.

What is the cheapest voice AI platform in 2026?

On a pure per-minute basis, Retell AI offers the lowest effective rate at approximately $0.07/min for the base platform, scaling to $0.13–$0.31/min with all components. However, the cheapest platform is not always the most cost-effective — the platform that produces the fewest failed calls and highest conversion rate typically delivers the lowest cost per converted call, which is the metric that matters.

How do I track voice AI costs per call across multiple providers?

You need to correlate billing events from each provider using the call's unique identifiers — Twilio CallSid, ElevenLabs history item ID, and your orchestration platform's call ID — and sum the charges per call. Most teams do this manually with CSV exports and spreadsheets. Sherlock Calls automates this correlation and surfaces per-call cost breakdowns across your entire stack.

Should I use ElevenLabs Flash or Turbo for voice AI agents?

Both cost the same (1 credit per 2 characters). Flash v2.5 generates audio in under 75ms, while Turbo v2.5 takes around 300ms but produces higher-quality, more emotionally nuanced speech. For real-time conversational agents where silence-detection timeouts are a risk, Flash v2.5 is the safer choice. For outbound calls with longer monologues — such as product demos — Turbo v2.5's quality advantage is worth the latency trade-off.

Ready to investigate your own calls?

Connect Sherlock to your voice providers in under 2 minutes. Free to start — 100 credits, no credit card.

Start for free

← Back to the blog