TutorialsMarch 2, 202610 min readby José

Voice Ops Cost Monitoring: The Complete Setup Guide (2026)

Learn how to set up cost monitoring for voice operations across Twilio, ElevenLabs, and Vapi — with practical alerts that catch spend anomalies before the invoice arrives.

TL;DR — The short answer

1
Voice AI costs are metered across 3–5 providers simultaneously — Twilio, ElevenLabs, Vapi, your LLM, and your carrier — and no single provider dashboard shows total per-call cost.
2
The most dangerous cost anomalies are not big line items but small per-call overages that compound silently: a verbose AI agent burning 4x the expected TTS characters costs more per month than an outage.
3
Effective cost monitoring requires three layers: per-provider usage tracking, cross-provider per-call cost aggregation, and anomaly detection against rolling baselines — not just monthly invoice review.

Sherlock Holmes examines voice operations cost monitoring invoices on a wall, detecting anomalies in per-call billing data. — "Elementary, Watson — the per-call costs deviated three standard deviations before the billing cycle even closed."

Why voice AI teams keep getting surprised by the invoice

The voice AI cost problem is not that individual services are expensive. Twilio inbound at $0.0085 per minute is cheap. ElevenLabs Turbo at roughly $0.30 per 1,000 characters is reasonable. Vapi's $0.05-per-minute platform fee is transparent. The problem is that these costs are additive, metered independently, and invisible to each other.

A single 2-minute voice AI call can touch five billing meters: Twilio for the phone line ($0.017), your speech-to-text provider for transcription ($0.02), your LLM for the conversational logic ($0.02–$0.20 depending on model and token count), ElevenLabs for text-to-speech ($0.04–$0.12 depending on response length), and Vapi or Retell for orchestration ($0.10). The all-in cost ranges from $0.15 for a short, efficient interaction to $0.50+ for a longer call with verbose responses — and no single provider's dashboard shows that total.

This fragmentation creates a specific failure mode: each provider's spend looks normal in isolation while the aggregate is 2–3x the budget. Your Twilio invoice is flat because call volume did not change. Your ElevenLabs bill spiked 40% — but you only check that dashboard monthly. Your Vapi bill tracks call volume, which looks fine. The aggregate cost per converted call, the metric that actually matters, increased from $0.35 to $0.85 over three weeks, and nobody noticed until the end-of-month invoice review.

Real production incidents confirm this pattern. A Vapi customer discovered that their ElevenLabs monthly character quota had been exhausted — calls had been failing silently for days with no audio because ElevenLabs returned 401 errors that the orchestration layer swallowed without surfacing. The team found out from customer complaints, not from any alert. In another common scenario, a development environment left connected to production Twilio credentials generated thousands of dollars in test call charges over a weekend before anyone noticed Monday morning.

The pattern is always the same: the cost anomaly was visible in the data, but nobody was looking at the right data at the right time.

The five metrics you need to track (and where to find them)

Cost monitoring for voice operations requires tracking five specific metrics across your provider stack. Each metric catches a different class of cost anomaly.

1. Total daily spend per provider. The baseline metric. Pull it from Twilio's Usage Records API (GET /2010-04-01/Accounts/{sid}/Usage/Records/Daily), ElevenLabs' /v1/user/subscription endpoint (current character count vs. limit), and Vapi's usage dashboard or API. Track each provider's daily cost as a time series. Alert when any provider's daily spend exceeds 1.5x the 7-day rolling average for that day of the week (weekday vs. weekend patterns matter — call volume is typically 30–50% lower on weekends).

2. Cost per call (all-in). This is the metric most teams do not track and the one that matters most. To calculate it, you need to correlate each Twilio call (identified by CallSid) with the corresponding ElevenLabs TTS session (identified by history_item_id) and any LLM API calls. Sum the per-provider costs for each call. Track the distribution — mean, p50, p90, p99. A rising p90 means your expensive calls are getting more expensive, even if the average looks stable. Alert when p90 cost per call exceeds 2x the 7-day rolling p90.

3. TTS character consumption rate. ElevenLabs and similar TTS providers meter by characters, not by minutes. A concise AI agent that responds in 50–80 characters per turn uses roughly 300 characters per minute of conversation. A verbose agent — or one whose underlying LLM is hallucinating long, repetitive answers — can consume 800–1,200 characters per minute. Track characters consumed per call and per minute of call time. Alert when per-minute consumption exceeds 1.5x the baseline. This catches LLM verbosity drift before it becomes a billing event.

4. Failed call rate and cost of failures. Failed calls still cost money. Twilio bills for any connected call regardless of outcome. ElevenLabs charges for characters consumed even if the audio was never played to the caller. If your failed call rate is 5% and those calls average 8 seconds each, you are paying for telephony minutes and TTS generation that delivered zero value. Track the cost of failed calls as a separate line item. Alert when failed-call cost exceeds 10% of total daily spend.

5. API error rate by provider. API errors from any provider in the voice chain can trigger retries, which trigger additional costs. A Twilio 11200 error causes the call to fail and be retried. An ElevenLabs rate limit (429) causes the orchestration layer to retry the TTS request — sometimes multiple times — each consuming characters. Track error rates per provider per hour. Alert when the error rate exceeds 2x the baseline for that provider. The cost impact is often 3–5x the error rate itself because of retry amplification.

The DIY approach: building cost alerts with provider APIs and webhooks

If you want to build cost monitoring yourself, here is the minimum viable implementation. It requires access to each provider's API, a time-series store (even a database table works), and a notification mechanism (Slack webhook, PagerDuty, or email).

Twilio usage tracking. Set up a daily cron job that calls the Twilio Usage Records API for the previous 24 hours. Filter for category=calls and category=calls-inbound/calls-outbound. Store the daily totals. Twilio also supports Usage Triggers — preconfigured thresholds that fire a webhook when your account-level spend crosses a dollar amount. Create triggers at 50%, 75%, and 90% of your monthly budget. The limitation: Usage Triggers are account-level only, so they will not alert you to a single phone number or agent generating disproportionate cost.

ElevenLabs quota monitoring. Call GET /v1/user/subscription every hour. Extract character_count (used) and character_limit (total). Calculate the burn rate: characters consumed in the last 24 hours divided by 24 gives you an hourly consumption rate. Project when you will hit the limit at the current burn rate. Alert at three thresholds: (1) 70% of monthly limit consumed, (2) 90% of monthly limit consumed, (3) projected exhaustion date is less than 5 days away. The 70% alert is more important than it sounds — ElevenLabs quota exhaustion does not produce a graceful degradation. API calls return a 401 error, and if your orchestration layer does not handle this specific error code, calls proceed with no audio. Your callers hear silence.

Per-call cost aggregation. This is the hard part. For every Twilio call, log the CallSid, duration, and direction (which determines the per-minute rate). For every ElevenLabs TTS generation tied to that call, log the characters consumed. Join these records by timestamp (Twilio call start time plus or minus 2 seconds against ElevenLabs generation time) to calculate all-in cost per call. Store the results and compute daily statistics.

The challenge with this approach is maintenance. Every API change, every new provider added to your stack, every change in pricing tier requires updating the monitoring code. Teams that build this in-house typically maintain it well for the first 2–3 months, then it drifts as priorities shift — which is precisely when cost anomalies start going undetected.

The multi-provider problem: costs scattered across five dashboards

The fundamental obstacle to voice ops cost monitoring is not technical difficulty — it is fragmentation. A typical production voice AI deployment involves billing relationships with three to five separate providers, each with its own dashboard, its own billing cycle, its own usage API, and its own definition of what constitutes a billable event.

Twilio bills per minute, rounded up to the nearest minute, with different rates for inbound vs. outbound, domestic vs. international, and toll-free vs. local numbers. ElevenLabs bills per character with tier-based pricing that resets monthly. Vapi bills per minute of orchestration time with a base platform fee. Your LLM provider (OpenAI, Anthropic) bills per token with different rates for input and output tokens. Your carrier or SIP trunk bills per minute with different rates by destination.

No single dashboard aggregates these. The person responsible for voice operations cost — whether that is an engineering manager, a finance lead, or a VP of operations — has to log into each dashboard separately, export usage data, normalize the units (minutes vs. characters vs. tokens vs. seconds), align the billing periods (which may not match), and manually calculate the total.

This process happens, at best, once a month during invoice review. Which means cost anomalies have a detection latency of 2–4 weeks. A misconfigured agent that is 3x more expensive per call than it should be runs for 3 weeks before anyone looks at the numbers. At 200 calls per day, that is 4,200 over-budget calls — potentially thousands of dollars in excess spend — before the anomaly is even detected, let alone resolved.

The teams that avoid this failure mode are the ones that automate the aggregation. Whether through a custom internal dashboard, a spreadsheet that pulls from provider APIs weekly, or a tool like Sherlock Calls that correlates provider data continuously, the common factor is that the aggregation happens on a daily cadence or faster — not on a monthly invoice review cycle.

How Sherlock Calls Heartbeat catches cost anomalies automatically

Sherlock Calls Heartbeat was built specifically for this class of problem: autonomous, scheduled monitoring across all your connected voice providers, with cost anomalies as a first-class finding category.

Here is how it works in practice. Heartbeat runs on a schedule you configure — every hour, every 2 hours, every 4 hours, every 8 hours, or daily. On each run, it connects to every provider in your stack (Twilio, ElevenLabs, Vapi, and others you have connected), pulls recent usage data, and runs an autonomous investigation using the same AI engine that powers Sherlock's on-demand investigations. It compares current metrics against a rolling baseline computed from the last 7 days of historical findings — average finding count per day, average severity score per provider and category. When it detects a deviation — a cost spike, an unusual error rate, a latency increase, a volume change — it classifies the finding by severity: critical, high, medium, low, or info.

Cost spikes are a specific finding category in the Heartbeat system. A finding might look like: "ElevenLabs character consumption rate increased 2.4x over the last 6 hours compared to the 7-day baseline. At the current burn rate, the monthly character limit will be exhausted in 4 days. The increase correlates with a new agent deployment at 14:00 UTC." That finding carries a severity of high, which triggers an immediate Slack DM to team members whose interests include costs — without waiting for the daily digest.

The escalation logic is severity-driven and role-aware. Critical findings — like a provider outage causing retry storms that are burning through budget — trigger immediate DMs to all workspace admins plus a post in the configured team Slack channel. High-severity findings go as immediate DMs to relevant team members matched by their role and stated interests. A finance lead who listed "costs" as an interest during onboarding receives cost-related findings immediately. An engineer who listed "errors" receives error-related findings. Medium and low findings are batched into a personalized daily digest, tailored to each team member's role — the finance lead's morning briefing emphasizes spend trends, the engineering manager's emphasizes error rates and performance.

Quiet hours are configurable per workspace. If your team does not want 2 AM alerts for medium-severity cost observations, set quiet hours — Heartbeat still runs and records findings during those hours, but only critical and high-severity alerts are delivered. Everything else waits for the morning digest. The monitoring never stops; only the notification timing adjusts.

The per-run and per-day credit budgets ensure Heartbeat itself does not generate runaway costs. You set the maximum credits each run can consume and the maximum credits per day across all runs. If the budget is reached, the run completes gracefully with the findings gathered so far. No surprise charges from your monitoring tool — which would be ironic in a cost-monitoring context.

Setting up your first cost alerts: a practical checklist

Whether you use a DIY approach or a tool like Sherlock Calls, here are the specific alerts every voice AI operation should have in place by the end of this week.

Alert 1: Daily spend exceeds 1.5x the 7-day average. Scope: per provider (Twilio, ElevenLabs, Vapi separately). Why: catches sudden anomalies — retry storms, misconfigurations, traffic spikes from marketing campaigns. Implementation: pull daily totals from each provider's API, compare against stored 7-day average for the same day of the week. Notification: Slack message to your voice operations channel.

Alert 2: ElevenLabs character budget at 70% consumed. Scope: account-level. Why: ElevenLabs quota exhaustion causes silent call failures with no graceful degradation — your callers hear nothing. You need lead time to either reduce consumption or upgrade your tier. Implementation: hourly poll of /v1/user/subscription. Notification: DM to the engineering lead and the finance lead.

Alert 3: Cost per call p90 exceeds 2x the baseline. Scope: all-in cost (Twilio + TTS + LLM). Why: catches the slow-burn cost anomalies that per-provider alerts miss — an LLM generating longer responses increases TTS cost without changing Twilio cost. Implementation: requires per-call cost aggregation by correlating Twilio CallSids with ElevenLabs session IDs by timestamp. Notification: daily summary to the voice ops lead.

Alert 4: Failed-call cost exceeds 10% of daily spend. Scope: all providers combined. Why: failed calls that still incur cost (Twilio minutes billed, TTS characters consumed for audio that was never heard) represent pure waste. If waste exceeds 10% of total spend, something is systematically broken. Implementation: filter for calls with duration under 10 seconds or explicit error codes, sum their provider costs. Notification: Slack alert to engineering.

Alert 5: TTS characters per minute of call time exceeds 1.5x baseline. Scope: ElevenLabs or equivalent TTS provider. Why: catches LLM verbosity drift — the most common cause of gradual TTS cost increases. An AI agent whose average response grew from 80 characters to 200 characters over two weeks will consume 2.5x the TTS budget with no change in call volume or call duration. Implementation: log characters consumed per call alongside call duration, compute the ratio daily. Notification: weekly trend report to the product owner and engineering lead.

For teams that want these alerts running today without building the monitoring infrastructure, Sherlock Calls Heartbeat covers all five alert types out of the box. Connect your providers, set your schedule (hourly is recommended for cost-sensitive operations), and the first health check runs automatically. Findings are delivered to Slack with severity-based routing — critical and high findings reach the right person in minutes, everything else appears in the personalized morning digest. Start the free tier at usesherlock.ai.

See how Sherlock compares

vs Datadog vs Sentry vs New Relic vs Arize AI vs Langfuse vs Galileo

Explore Sherlock for your voice stack

Twilio ElevenLabs Vapi Retell AI Bland AI Genesys

Frequently asked questions

How much does a typical voice AI call cost across all providers?

A single voice AI call touches multiple billing meters simultaneously. Twilio charges $0.0085/min inbound and $0.014/min outbound. ElevenLabs charges roughly $0.30 per 1,000 characters on Turbo tiers. Vapi adds a $0.05/min platform fee. A typical 2-minute call costs $0.15–$0.40 all-in, but verbose AI responses or long hold times can push individual calls above $1.00 without any single line item appearing unusual.

What are the most common causes of unexpected voice AI bill spikes?

The three most common causes are: (1) LLM hallucinations generating excessively long TTS responses that burn through ElevenLabs character budgets 3–5x faster than expected, (2) retry storms where failed calls are automatically redialed without backoff, doubling or tripling Twilio minutes, and (3) development or staging environments left connected to production API keys, generating real charges on test traffic.

Does Twilio have built-in spend alerts?

Twilio offers usage triggers that can fire HTTP webhooks when your account reaches a spend threshold. You can configure them via the REST API or Console under Account > Usage Triggers. However, these triggers operate at the account level — they cannot alert on per-number, per-agent, or per-campaign spend, which is where most anomalies actually occur in voice AI deployments.

How do I monitor ElevenLabs character usage before hitting the limit?

Call the ElevenLabs /v1/user/subscription endpoint to check your current character count and limit. There is no built-in alerting — if your quota is exhausted, API calls return a 401 error that orchestration layers like Vapi may swallow silently, resulting in calls with no audio. You need an external monitor polling this endpoint and alerting at 70% and 90% thresholds.

Can I set up cost alerts per voice AI agent or campaign?

Not natively in most providers. Twilio usage triggers are account-level. ElevenLabs tracks usage per API key, not per agent. To get per-agent cost visibility, you need to tag calls at the application layer — logging the agent ID alongside the Twilio CallSid and ElevenLabs session ID — and then aggregating costs in your own system or using a tool like Sherlock Calls that correlates across providers automatically.

What is the fastest way to detect a voice AI cost anomaly?

Compare today's spend-per-hour against a 7-day rolling baseline. Any hour where spend exceeds 2x the baseline average for that hour-of-day warrants an alert. This catches both sudden spikes (retry storms, misconfigurations) and gradual drift (increasingly verbose LLM responses). Sherlock Calls Heartbeat runs this comparison automatically on a configurable schedule and posts findings to Slack.

How does Sherlock Calls Heartbeat monitor voice spend?

Heartbeat is an autonomous scheduled health check that runs on a configurable cadence — hourly, every 4 hours, or daily. It connects to your Twilio, ElevenLabs, and Vapi accounts, pulls recent usage data, compares it against historical baselines, and classifies any anomalies by severity. Cost spikes are a specific finding category. Critical and high-severity findings are delivered as immediate Slack DMs to relevant team members; lower-severity observations go into a personalized daily digest.

Ready to investigate your own calls?

Connect Sherlock to your voice providers in under 2 minutes. Free to start — 100 credits, no credit card.

Start for free

← Back to the blog