Last Updated: May 26, 2026 | 20-minute read
Who this is for: Sales engineers evaluating AI calling platforms, developers building voice agents, and curious practitioners who want to understand what's happening under the hood when an AI "listens" and responds on a phone call. No machine learning PhD required — but some comfort with APIs and systems thinking is helpful.
The 7-Step Journey of a Single AI Call
When a prospect picks up an AI-generated call, here's what happens in the 800ms-2,400ms before they hear the AI speak:
1. TELEPHONY LAYER → Call routed via SIP/PSTN to your number
2. AUDIO CAPTURE → Raw audio stream extracted from phone call
3. VAD → "Has the human stopped speaking?"
4. STT → Convert speech audio → text transcript
5. LLM INFERENCE → Generate AI response text from transcript
6. TTS SYNTHESIS → Convert response text → speech audio
7. AUDIO PLAYBACK → Stream audio back through phone call
Each step introduces latency. The sum of all seven is what the prospect experiences as the "pause" before the AI responds. This is the core engineering challenge of AI calling.
Layer 1: Telephony — How the Call Gets Routed
Before any AI can process a call, the phone infrastructure must connect. AI calling platforms use two protocols:
SIP (Session Initiation Protocol)
The industry standard for VoIP calls. An AI calling platform connects to the phone network through a SIP trunk — a digital pipeline from their servers to the PSTN (Public Switched Telephone Network).
Key components:
- SIP Trunk Provider: Twilio, Telnyx, Vonage, or Bandwidth. This is where per-minute telephony costs originate (0.03/min).
- DID (Direct Inward Dial): The actual phone number that appears on caller ID. Platforms provision DIDs in bulk for simultaneous outbound campaigns.
- Call pacing: Predictive dialers regulate how many calls initiate per second to avoid overwhelming agent capacity or triggering carrier spam filters.
WebRTC (Web Real-Time Communication)
An alternative protocol used when calls are browser-originated (click-to-call, web widgets). Handles audio encoding natively in the browser, removing the need for SIP infrastructure. Lower latency within controlled environments; less reliable for outbound at scale.
The Carrier Spam Filter Problem
Mobile carriers (AT&T, Verizon, T-Mobile) use STIR/SHAKEN call authentication to flag calls from numbers with poor reputation. AI calling campaigns from new or unregistered numbers get labeled "Scam Likely" — immediately reducing connect rates by 30-60%.
What good platforms do: Register numbers with carriers in advance, use number warming (low-volume calls before campaigns), and rotate DID pools to avoid reputation damage.
Layer 2: Audio Capture & Encoding
Once connected, the phone call's audio stream is captured and encoded. This happens differently depending on the telephony protocol:
- SIP Calls: Audio arrives as G.711 μ-law (PCMU) or G.722 wideband codec. Platform converts to PCM 16kHz mono before passing to STT.
- WebRTC Calls: Audio arrives as Opus-encoded stream. Platform decodes before STT processing.
Why this matters: Low-quality audio encoding (8kHz narrowband, high compression) degrades STT accuracy by 15-30%. Platforms that use wideband audio (G.722 or Opus) produce cleaner transcripts.
Layer 3: Voice Activity Detection (VAD) — The Unsung Hero
VAD is the technology that answers the question: "Has the human stopped talking, and should I respond now?"
This sounds simple. It is not. VAD must distinguish:
- Speech vs. silence vs. background noise
- End of utterance vs. mid-sentence pause ("I was thinking... yeah, go ahead")
- Interruption intent vs. filler word ("um, yeah, uh-huh")
VAD Implementations
| VAD Type | Latency | Accuracy | Used By |
|---|---|---|---|
| Energy-based VAD | 0-50ms | 70-80% | Legacy systems |
| Silero VAD | 30-80ms | 88-93% | Most modern platforms |
| WebRTC VAD | 10-30ms | 82-87% | Browser-based agents |
| LLM-guided VAD | 200-500ms | 94-97% | Premium platforms |
The Three VAD Failure Modes
1. Premature cutoff: VAD triggers response before the human finishes. Result: AI interrupts. This is the #1 complaint in AI calling.
2. Silence timeout too long: VAD waits too long after silence to trigger response. Result: Awkward 2-3 second pause. Feels robotic.
3. Background noise false positive: AC noise, traffic, or music triggers VAD on silence. Result: AI speaks while human is paused thinking. Double interruption.
Best practice: Tunable end-of-utterance timeout. Aggressive (300ms) for quick-turn conversations; conservative (800ms) for complex, thoughtful responses.
Layer 4: Speech-to-Text (STT) — Converting Voice to Words
STT converts the raw audio stream into text that the LLM can process. This is one of the fastest-evolving components in the stack.
Major STT Providers in 2026
| Provider | Model | Accuracy (English) | Latency | Best For |
|---|---|---|---|---|
| Deepgram | Nova-3 | 95-97% | 150-300ms | Real-time streaming |
| OpenAI | Whisper Large v3 | 96-98% | 400-900ms | Batch, high accuracy |
| Chirp 2 | 94-96% | 200-400ms | Multilingual | |
| AssemblyAI | Universal-1 | 93-96% | 200-350ms | Speaker diarization |
| Microsoft | Azure Fast Transcription | 93-95% | 150-250ms | Azure ecosystem |
| Speechmatics | Flow | 91-94% | 180-320ms | Accented speech |
Streaming vs. Batch STT
There are two modes:
Batch STT: Audio is recorded, then sent to STT after the user finishes speaking. More accurate, but adds 200-600ms latency.
Streaming STT: Audio is transcribed in real-time as the user speaks. The transcript is continuously updated and passed to the LLM the moment VAD detects end-of-speech. Reduces total latency by 200-500ms at the cost of occasional mid-utterance transcription errors.
Most production AI calling platforms use streaming STT. The latency advantage outweighs the small accuracy trade-off.
STT Accuracy on Phone Calls
Phone audio is not podcast quality. STT must handle:
- 8kHz narrowband audio (standard PSTN)
- Background noise (offices, cars, street)
- Heavy accents and non-native English
- Domain-specific terminology ("EBITDA", "HIPAA", "lead qualification")
Custom vocabulary and model fine-tuning can increase domain-specific accuracy by 8-15%. This is especially valuable in healthcare, legal, and financial services.
Layer 5: LLM Inference — The Brain of the AI Agent
The STT output (text transcript of what the prospect said) is passed to a Large Language Model, which generates the AI's response.
How the LLM Receives Context
Every AI calling platform sends a structured "prompt" to the LLM at each turn of the conversation. This prompt contains:
[SYSTEM PROMPT]
You are a friendly SDR for [Company]. Your job is to qualify leads for [Product].
- Ask about budget, timeline, and decision-making authority
- If the prospect mentions [keyword], route to human immediately
- Do not make claims about pricing without checking the knowledge base
[CONVERSATION HISTORY]
Human: "Yeah, I'm interested in learning more."
AI: "Great! Can I ask what's your main challenge with [problem] right now?"
Human: [current utterance - just transcribed]
[KNOWLEDGE BASE - RAG CONTEXT]
[Relevant product/company information retrieved for this call]
The LLM processes this and generates the next AI response. LLM inference is usually the largest latency contributor in the entire pipeline.
LLM Latency (Time to First Token)
| Model | Provider | TTFT (p50) | Cost/Min (est.) | Intelligence |
|---|---|---|---|---|
| GPT-4o-mini | OpenAI | 400-600ms | $0.01-0.03 | Good |
| GPT-4o | OpenAI | 600-1,200ms | $0.08-0.15 | Excellent |
| Claude 3.5 Haiku | Anthropic | 300-500ms | $0.005-0.01 | Good |
| Claude 3.5 Sonnet | Anthropic | 500-900ms | $0.06-0.12 | Excellent |
| Llama 3.3 70B | Groq | 150-250ms | $0.02-0.05 | Good |
| Gemini 2.0 Flash | 250-450ms | $0.01-0.03 | Good |
Why Groq (Llama on dedicated hardware) is popular for low-latency AI calling: 150-250ms TTFT is 3-5x faster than cloud LLM providers. The trade-off is slightly lower reasoning capability than GPT-4o.
Streaming LLM Responses
Just as streaming STT reduces latency, streaming LLM output reduces TTS latency. Instead of waiting for the full response to generate, the TTS system begins synthesizing audio as soon as the first sentence token arrives.
With streaming:
- LLM begins outputting: "That's a great question—"
- TTS begins synthesizing "That's" simultaneously
- By the time LLM outputs the full response, "That's a great question—" is already buffered for playback
This technique reduces perceived latency by 200-400ms.
RAG (Retrieval-Augmented Generation) in AI Calling
RAG allows the AI to "look up" information during a call rather than relying solely on what's in the system prompt. The knowledge base might contain:
- Product pricing and features
- Case studies and proof points
- Objection handling playbooks
- Prospect's CRM history
RAG architecture adds 50-200ms of latency (vector database lookup) but dramatically reduces hallucination rates and enables real-time personalization.
Layer 6: Text-to-Speech (TTS) — Making the AI Sound Human
The LLM's text response is converted to audio by a TTS engine. This is where the "voice" of your AI agent is determined.
Major TTS Providers in 2026
| Provider | Model | Quality | Latency | Cost/1M chars | Emotion Control |
|---|---|---|---|---|---|
| ElevenLabs | Turbo v3 | ⭐⭐⭐⭐⭐ | 80-150ms | $11-18 | Yes |
| OpenAI | TTS-1-HD | ⭐⭐⭐⭐ | 100-200ms | $30 | Limited |
| PlayHT | PlayDialog | ⭐⭐⭐⭐ | 70-120ms | $8-15 | Yes |
| Cartesia | Sonic | ⭐⭐⭐⭐⭐ | 50-90ms | $15-25 | Yes |
| Azure | Neural TTS | ⭐⭐⭐ | 100-250ms | $4-8 | Limited |
| WaveNet | ⭐⭐⭐ | 80-180ms | $4-16 | Minimal |
What Makes a Voice Sound Human
Human speech contains:
- Prosody: Natural rhythm, stress, and intonation
- Micro-pauses: Brief hesitations between phrases
- Emphasis variation: Key words spoken slightly louder/slower
- Emotional coloring: Warmth, enthusiasm, curiosity
Modern neural TTS models (ElevenLabs Turbo v3, Cartesia Sonic) can reproduce all four. Models from 2021-2022 (Google Standard, Polly) cannot — they sound flat and mechanical.
Voice Cloning for AI Calling
Most premium TTS providers offer voice cloning — creating a custom voice from 30-60 seconds of audio. This allows:
- AI agent sounds like a specific salesperson
- Consistent brand voice across all AI calls
- Localized voices for different markets
Ethical consideration: Voice cloning for AI calling without disclosure may violate the FCC's 2024 AI voice rules. Always disclose when an AI voice is being used in marketing contexts.
The Full Latency Chain: Where Your Milliseconds Go
Here's the complete breakdown of a typical AI calling turn, from when the prospect finishes speaking to when they hear the AI's first word:
Prospect finishes speaking
│
▼
[VAD End-of-Utterance Detection] 50-200ms
│
▼
[Audio Processing & Encoding] 10-30ms
│
▼
[STT Transcription (streaming)] 150-350ms
│
▼
[LLM Inference (first token, streaming)] 300-1,200ms
│
▼
[TTS Synthesis (first chunk, streaming)] 50-150ms
│
▼
[Network Transmission to Phone] 30-100ms
│
▼
Prospect hears first word
│
TOTAL: 590ms – 2,030ms
The practical threshold: Under 800ms feels like a natural conversation. 800ms-1,400ms feels slightly delayed but acceptable. Over 1,400ms feels robotic and triggers hang-ups.
How AI Calling Platforms Optimize Latency
Technique 1: Parallel Processing
Start TTS as soon as the first LLM sentence token arrives. Don't wait for the full response.
Technique 2: Speculative Responses
Pre-compute likely responses for common utterances ("yes", "I understand", "tell me more"). Cache TTS audio for instant playback on these high-frequency turns.
Technique 3: Edge Inference
Run LLM inference on servers geographically close to the caller. Moving from US West to a regional edge server reduces network latency by 30-80ms for APAC or European callers.
Technique 4: Dynamic VAD Tuning
Shorter end-of-utterance timeout (250ms) for quick-turn conversations. Longer (700ms) for complex questions where callers naturally pause while thinking.
Technique 5: STT Confidence Thresholding
Begin LLM processing when STT transcript confidence crosses 85% — don't wait for 100% certainty. Re-run if the final transcript differs from the preliminary.
What Happens When Things Go Wrong
Scenario 1: STT Mishears a Word
Prospect says: "We need better compliance with HIPAA." STT outputs: "We need better compliance with hip-hop."
LLM response might: Respond sensibly to context, ignoring the mis-transcription, or say something confusing if "hip-hop" triggers a nonsensical branch.
Fix: Confidence scoring + custom vocabulary for domain terms.
Scenario 2: LLM Hallucination
Prospect asks: "What does your platform cost?" LLM responds with pricing it invented that doesn't match reality.
Fix: RAG retrieval for pricing data, LLM guardrails ("Do not answer pricing questions. Say: 'I'll connect you with the right person for pricing specifics.'")
Scenario 3: Double Interruption
AI and prospect both start speaking at the same moment. VAD triggers response. Prospect hears garbled audio from competing streams.
Fix: Full-duplex audio handling with audio mixing + priority logic (human speech always wins).
Scenario 4: Call Drops Mid-Conversation
SIP connection drops at 45 seconds. AI session state is lost. No callback, no CRM record, no follow-up.
Fix: Session state persistence every 10 seconds. Automatic retry logic with saved conversation context.
The Emerging Architecture: Multimodal AI Calling
The next generation of AI calling is moving beyond audio-only. In 2026, early platforms are experimenting with:
- Real-time web lookup: AI browses the prospect's company website mid-call to personalize
- Live CRM enrichment: Pulling prospect's deal history, prior calls, and notes as the call progresses
- Emotion detection: Analyzing vocal tone to detect frustration, interest, or uncertainty
- Predictive objection handling: Classifying objections in real-time and surfacing relevant responses from a playbook
These capabilities add latency but enable more sophisticated conversations. Expect them to become standard in enterprise AI calling by 2027.
Frequently Asked Questions
What technology does AI calling use?
AI calling uses a four-layer stack: Telephony (SIP/WebRTC) to route calls, Speech-to-Text (STT) to transcribe speech, a Large Language Model (LLM) to generate responses, and Text-to-Speech (TTS) to speak the response. These run sequentially and must complete in under 1-2 seconds for natural conversation.
Why do AI calls have a delay before responding?
The delay comes from four sequential processing steps: STT transcription (150-400ms), LLM inference (400-1,200ms), TTS synthesis (100-300ms), and network transmission (80-200ms). Best-in-class platforms achieve under 800ms through streaming STT, edge LLM inference, and pre-buffered TTS.
What is the best LLM for AI calling?
For low-latency outbound sales calls, Llama 3.3 70B on Groq (150-250ms TTFT) or Claude 3.5 Haiku (300-500ms) offer the best speed-quality trade-off. For complex enterprise calls requiring high reasoning, GPT-4o or Claude 3.5 Sonnet deliver better conversation quality at the cost of 600-1,200ms additional latency.
How accurate is speech recognition in AI calling?
Modern STT systems like Deepgram Nova-3 achieve 95-97% word accuracy on clear English audio. Accuracy drops to 88-93% with background noise, heavy accents, or narrow-band phone audio. Custom vocabulary fine-tuning can recover 5-8% accuracy on domain-specific terms.
Can AI calling be done without a cloud LLM?
Yes. Some platforms support on-premise or private cloud LLM deployment using open-source models like Llama 3 or Mistral. This reduces per-call LLM cost but requires significant infrastructure investment and sacrifices the latest model capabilities.
Technical specifications cited in this article are based on published API documentation, community benchmarks, and first-party testing as of May 2026. Specific latency numbers will vary based on infrastructure geography, audio quality, and LLM model selection. This article is updated as the technology evolves.