What is Voice Activity Detection (VAD) in AI calling?

VAD (Voice Activity Detection) is the system that determines when a caller has finished speaking. It distinguishes between speech, silence, and background noise to know when to trigger the AI's response. Poor VAD causes AI to interrupt callers mid-sentence or pause too long before responding.

How AI Calling Actually Works: The Complete Technical Guide (STT → LLM → TTS)

Last Updated: May 26, 2026 | 20-minute read

Who this is for: Sales engineers evaluating AI calling platforms, developers building voice agents, and curious practitioners who want to understand what's happening under the hood when an AI "listens" and responds on a phone call. No machine learning PhD required — but some comfort with APIs and systems thinking is helpful.

The 7-Step Journey of a Single AI Call

When a prospect picks up an AI-generated call, here's what happens in the 800ms-2,400ms before they hear the AI speak:

1. TELEPHONY LAYER    → Call routed via SIP/PSTN to your number
2. AUDIO CAPTURE      → Raw audio stream extracted from phone call
3. VAD               → "Has the human stopped speaking?"
4. STT               → Convert speech audio → text transcript
5. LLM INFERENCE     → Generate AI response text from transcript
6. TTS SYNTHESIS     → Convert response text → speech audio
7. AUDIO PLAYBACK    → Stream audio back through phone call

Each step introduces latency. The sum of all seven is what the prospect experiences as the "pause" before the AI responds. This is the core engineering challenge of AI calling.

Layer 1: Telephony — How the Call Gets Routed

Before any AI can process a call, the phone infrastructure must connect. AI calling platforms use two protocols:

SIP (Session Initiation Protocol)

The industry standard for VoIP calls. An AI calling platform connects to the phone network through a SIP trunk — a digital pipeline from their servers to the PSTN (Public Switched Telephone Network).

Key components:

SIP Trunk Provider: Twilio, Telnyx, Vonage, or Bandwidth. This is where per-minute telephony costs originate ( $0.005-$ 0.03/min).
DID (Direct Inward Dial): The actual phone number that appears on caller ID. Platforms provision DIDs in bulk for simultaneous outbound campaigns.
Call pacing: Predictive dialers regulate how many calls initiate per second to avoid overwhelming agent capacity or triggering carrier spam filters.

WebRTC (Web Real-Time Communication)

An alternative protocol used when calls are browser-originated (click-to-call, web widgets). Handles audio encoding natively in the browser, removing the need for SIP infrastructure. Lower latency within controlled environments; less reliable for outbound at scale.

The Carrier Spam Filter Problem

Mobile carriers (AT&T, Verizon, T-Mobile) use STIR/SHAKEN call authentication to flag calls from numbers with poor reputation. AI calling campaigns from new or unregistered numbers get labeled "Scam Likely" — immediately reducing connect rates by 30-60%.

What good platforms do: Register numbers with carriers in advance, use number warming (low-volume calls before campaigns), and rotate DID pools to avoid reputation damage.

Layer 2: Audio Capture & Encoding

Once connected, the phone call's audio stream is captured and encoded. This happens differently depending on the telephony protocol:

SIP Calls: Audio arrives as G.711 μ-law (PCMU) or G.722 wideband codec. Platform converts to PCM 16kHz mono before passing to STT.
WebRTC Calls: Audio arrives as Opus-encoded stream. Platform decodes before STT processing.

Why this matters: Low-quality audio encoding (8kHz narrowband, high compression) degrades STT accuracy by 15-30%. Platforms that use wideband audio (G.722 or Opus) produce cleaner transcripts.

Layer 3: Voice Activity Detection (VAD) — The Unsung Hero

VAD is the technology that answers the question: "Has the human stopped talking, and should I respond now?"

This sounds simple. It is not. VAD must distinguish:

Speech vs. silence vs. background noise
End of utterance vs. mid-sentence pause ("I was thinking... yeah, go ahead")
Interruption intent vs. filler word ("um, yeah, uh-huh")

VAD Implementations

VAD Type	Latency	Accuracy	Used By
Energy-based VAD	0-50ms	70-80%	Legacy systems
Silero VAD	30-80ms	88-93%	Most modern platforms
WebRTC VAD	10-30ms	82-87%	Browser-based agents
LLM-guided VAD	200-500ms	94-97%	Premium platforms

The Three VAD Failure Modes

1. Premature cutoff: VAD triggers response before the human finishes. Result: AI interrupts. This is the #1 complaint in AI calling.

2. Silence timeout too long: VAD waits too long after silence to trigger response. Result: Awkward 2-3 second pause. Feels robotic.

3. Background noise false positive: AC noise, traffic, or music triggers VAD on silence. Result: AI speaks while human is paused thinking. Double interruption.

Best practice: Tunable end-of-utterance timeout. Aggressive (300ms) for quick-turn conversations; conservative (800ms) for complex, thoughtful responses.

Layer 4: Speech-to-Text (STT) — Converting Voice to Words

STT converts the raw audio stream into text that the LLM can process. This is one of the fastest-evolving components in the stack.

Major STT Providers in 2026

Provider	Model	Accuracy (English)	Latency	Best For
Deepgram	Nova-3	95-97%	150-300ms	Real-time streaming
OpenAI	Whisper Large v3	96-98%	400-900ms	Batch, high accuracy
Google	Chirp 2	94-96%	200-400ms	Multilingual
AssemblyAI	Universal-1	93-96%	200-350ms	Speaker diarization
Microsoft	Azure Fast Transcription	93-95%	150-250ms	Azure ecosystem
Speechmatics	Flow	91-94%	180-320ms	Accented speech

Streaming vs. Batch STT

There are two modes:

Batch STT: Audio is recorded, then sent to STT after the user finishes speaking. More accurate, but adds 200-600ms latency.

Streaming STT: Audio is transcribed in real-time as the user speaks. The transcript is continuously updated and passed to the LLM the moment VAD detects end-of-speech. Reduces total latency by 200-500ms at the cost of occasional mid-utterance transcription errors.

Most production AI calling platforms use streaming STT. The latency advantage outweighs the small accuracy trade-off.

STT Accuracy on Phone Calls

Phone audio is not podcast quality. STT must handle:

8kHz narrowband audio (standard PSTN)
Background noise (offices, cars, street)
Heavy accents and non-native English
Domain-specific terminology ("EBITDA", "HIPAA", "lead qualification")

Custom vocabulary and model fine-tuning can increase domain-specific accuracy by 8-15%. This is especially valuable in healthcare, legal, and financial services.

Layer 5: LLM Inference — The Brain of the AI Agent

The STT output (text transcript of what the prospect said) is passed to a Large Language Model, which generates the AI's response.

How the LLM Receives Context

Every AI calling platform sends a structured "prompt" to the LLM at each turn of the conversation. This prompt contains:

[SYSTEM PROMPT]
You are a friendly SDR for [Company]. Your job is to qualify leads for [Product].
- Ask about budget, timeline, and decision-making authority
- If the prospect mentions [keyword], route to human immediately
- Do not make claims about pricing without checking the knowledge base

[CONVERSATION HISTORY]
Human: "Yeah, I'm interested in learning more."
AI: "Great! Can I ask what's your main challenge with [problem] right now?"
Human: [current utterance - just transcribed]

[KNOWLEDGE BASE - RAG CONTEXT]
[Relevant product/company information retrieved for this call]

The LLM processes this and generates the next AI response. LLM inference is usually the largest latency contributor in the entire pipeline.

LLM Latency (Time to First Token)

Model	Provider	TTFT (p50)	Cost/Min (est.)	Intelligence
GPT-4o-mini	OpenAI	400-600ms	$0.01-0.03	Good
GPT-4o	OpenAI	600-1,200ms	$0.08-0.15	Excellent
Claude 3.5 Haiku	Anthropic	300-500ms	$0.005-0.01	Good
Claude 3.5 Sonnet	Anthropic	500-900ms	$0.06-0.12	Excellent
Llama 3.3 70B	Groq	150-250ms	$0.02-0.05	Good
Gemini 2.0 Flash	Google	250-450ms	$0.01-0.03	Good

Why Groq (Llama on dedicated hardware) is popular for low-latency AI calling: 150-250ms TTFT is 3-5x faster than cloud LLM providers. The trade-off is slightly lower reasoning capability than GPT-4o.

Streaming LLM Responses

Just as streaming STT reduces latency, streaming LLM output reduces TTS latency. Instead of waiting for the full response to generate, the TTS system begins synthesizing audio as soon as the first sentence token arrives.

With streaming:

LLM begins outputting: "That's a great question—"
TTS begins synthesizing "That's" simultaneously
By the time LLM outputs the full response, "That's a great question—" is already buffered for playback

This technique reduces perceived latency by 200-400ms.

RAG (Retrieval-Augmented Generation) in AI Calling

RAG allows the AI to "look up" information during a call rather than relying solely on what's in the system prompt. The knowledge base might contain:

Product pricing and features
Case studies and proof points
Objection handling playbooks
Prospect's CRM history

RAG architecture adds 50-200ms of latency (vector database lookup) but dramatically reduces hallucination rates and enables real-time personalization.

Layer 6: Text-to-Speech (TTS) — Making the AI Sound Human

The LLM's text response is converted to audio by a TTS engine. This is where the "voice" of your AI agent is determined.

Major TTS Providers in 2026

Provider	Model	Quality	Latency	Cost/1M chars	Emotion Control
ElevenLabs	Turbo v3	⭐⭐⭐⭐⭐	80-150ms	$11-18	Yes
OpenAI	TTS-1-HD	⭐⭐⭐⭐	100-200ms	$30	Limited
PlayHT	PlayDialog	⭐⭐⭐⭐	70-120ms	$8-15	Yes
Cartesia	Sonic	⭐⭐⭐⭐⭐	50-90ms	$15-25	Yes
Azure	Neural TTS	⭐⭐⭐	100-250ms	$4-8	Limited
Google	WaveNet	⭐⭐⭐	80-180ms	$4-16	Minimal

What Makes a Voice Sound Human

Human speech contains:

Prosody: Natural rhythm, stress, and intonation
Micro-pauses: Brief hesitations between phrases
Emphasis variation: Key words spoken slightly louder/slower
Emotional coloring: Warmth, enthusiasm, curiosity

Modern neural TTS models (ElevenLabs Turbo v3, Cartesia Sonic) can reproduce all four. Models from 2021-2022 (Google Standard, Polly) cannot — they sound flat and mechanical.

Voice Cloning for AI Calling

Most premium TTS providers offer voice cloning — creating a custom voice from 30-60 seconds of audio. This allows:

AI agent sounds like a specific salesperson
Consistent brand voice across all AI calls
Localized voices for different markets

Ethical consideration: Voice cloning for AI calling without disclosure may violate the FCC's 2024 AI voice rules. Always disclose when an AI voice is being used in marketing contexts.

The Full Latency Chain: Where Your Milliseconds Go

Here's the complete breakdown of a typical AI calling turn, from when the prospect finishes speaking to when they hear the AI's first word:

Prospect finishes speaking
         │
         ▼
[VAD End-of-Utterance Detection]         50-200ms
         │
         ▼
[Audio Processing & Encoding]            10-30ms
         │
         ▼
[STT Transcription (streaming)]          150-350ms
         │
         ▼
[LLM Inference (first token, streaming)] 300-1,200ms
         │
         ▼
[TTS Synthesis (first chunk, streaming)] 50-150ms
         │
         ▼
[Network Transmission to Phone]          30-100ms
         │
         ▼
Prospect hears first word
         │
TOTAL:   590ms – 2,030ms

The practical threshold: Under 800ms feels like a natural conversation. 800ms-1,400ms feels slightly delayed but acceptable. Over 1,400ms feels robotic and triggers hang-ups.

How AI Calling Platforms Optimize Latency

Technique 1: Parallel Processing

Start TTS as soon as the first LLM sentence token arrives. Don't wait for the full response.

Technique 2: Speculative Responses

Pre-compute likely responses for common utterances ("yes", "I understand", "tell me more"). Cache TTS audio for instant playback on these high-frequency turns.

Technique 3: Edge Inference

Run LLM inference on servers geographically close to the caller. Moving from US West to a regional edge server reduces network latency by 30-80ms for APAC or European callers.

Technique 4: Dynamic VAD Tuning

Shorter end-of-utterance timeout (250ms) for quick-turn conversations. Longer (700ms) for complex questions where callers naturally pause while thinking.

Technique 5: STT Confidence Thresholding

Begin LLM processing when STT transcript confidence crosses 85% — don't wait for 100% certainty. Re-run if the final transcript differs from the preliminary.

What Happens When Things Go Wrong

Scenario 1: STT Mishears a Word

Prospect says: "We need better compliance with HIPAA." STT outputs: "We need better compliance with hip-hop."

LLM response might: Respond sensibly to context, ignoring the mis-transcription, or say something confusing if "hip-hop" triggers a nonsensical branch.

Fix: Confidence scoring + custom vocabulary for domain terms.

Scenario 2: LLM Hallucination

Prospect asks: "What does your platform cost?" LLM responds with pricing it invented that doesn't match reality.

Fix: RAG retrieval for pricing data, LLM guardrails ("Do not answer pricing questions. Say: 'I'll connect you with the right person for pricing specifics.'")

Scenario 3: Double Interruption

AI and prospect both start speaking at the same moment. VAD triggers response. Prospect hears garbled audio from competing streams.

Fix: Full-duplex audio handling with audio mixing + priority logic (human speech always wins).

Scenario 4: Call Drops Mid-Conversation

SIP connection drops at 45 seconds. AI session state is lost. No callback, no CRM record, no follow-up.

Fix: Session state persistence every 10 seconds. Automatic retry logic with saved conversation context.

The Emerging Architecture: Multimodal AI Calling

The next generation of AI calling is moving beyond audio-only. In 2026, early platforms are experimenting with:

Real-time web lookup: AI browses the prospect's company website mid-call to personalize
Live CRM enrichment: Pulling prospect's deal history, prior calls, and notes as the call progresses
Emotion detection: Analyzing vocal tone to detect frustration, interest, or uncertainty
Predictive objection handling: Classifying objections in real-time and surfacing relevant responses from a playbook

These capabilities add latency but enable more sophisticated conversations. Expect them to become standard in enterprise AI calling by 2027.

Frequently Asked Questions

What technology does AI calling use?

AI calling uses a four-layer stack: Telephony (SIP/WebRTC) to route calls, Speech-to-Text (STT) to transcribe speech, a Large Language Model (LLM) to generate responses, and Text-to-Speech (TTS) to speak the response. These run sequentially and must complete in under 1-2 seconds for natural conversation.

Why do AI calls have a delay before responding?

The delay comes from four sequential processing steps: STT transcription (150-400ms), LLM inference (400-1,200ms), TTS synthesis (100-300ms), and network transmission (80-200ms). Best-in-class platforms achieve under 800ms through streaming STT, edge LLM inference, and pre-buffered TTS.

What is the best LLM for AI calling?

For low-latency outbound sales calls, Llama 3.3 70B on Groq (150-250ms TTFT) or Claude 3.5 Haiku (300-500ms) offer the best speed-quality trade-off. For complex enterprise calls requiring high reasoning, GPT-4o or Claude 3.5 Sonnet deliver better conversation quality at the cost of 600-1,200ms additional latency.

How accurate is speech recognition in AI calling?

Modern STT systems like Deepgram Nova-3 achieve 95-97% word accuracy on clear English audio. Accuracy drops to 88-93% with background noise, heavy accents, or narrow-band phone audio. Custom vocabulary fine-tuning can recover 5-8% accuracy on domain-specific terms.

Can AI calling be done without a cloud LLM?

Yes. Some platforms support on-premise or private cloud LLM deployment using open-source models like Llama 3 or Mistral. This reduces per-call LLM cost but requires significant infrastructure investment and sacrifices the latest model capabilities.

Technical specifications cited in this article are based on published API documentation, community benchmarks, and first-party testing as of May 2026. Specific latency numbers will vary based on infrastructure geography, audio quality, and LLM model selection. This article is updated as the technology evolves.