Deep Dive

Voice AI: Under the Hood

Understand exactly how AI phone agents and voice assistants work - from the moment you speak to when you hear a response.

~500ms
Total Latency
6
Pipeline Steps
95%+
Accuracy
The Pipeline

The Complete Voice AI Pipeline

Every voice AI interaction flows through these six stages. Understanding each step helps you optimize for speed, accuracy, and natural conversation.

1~0ms

How Voice AI Works | Deep Dive

User speaks into microphone

Details

The system captures raw audio from the user's microphone. This audio waveform contains the user's speech along with any background noise.

Example

Audio stream: 16kHz sample rate, 16-bit depth
2~100-300ms

2. Speech-to-Text

Convert audio to text (ASR)

Details

Automatic Speech Recognition (ASR) models like Whisper or Deepgram convert the audio waveform into text. Modern models handle accents, background noise, and multiple speakers.

Example

Audio → "I need to book an appointment for tomorrow at 3pm"
3~200-500ms

3. Understanding

LLM processes the request

Details

The Large Language Model analyzes the transcribed text to understand intent, extract entities (dates, times, names), and determine the appropriate response or action.

Example

Intent: book_appointment, Time: 3pm, Date: tomorrow
4~100-500ms

4. Action (Optional)

Execute function calls if needed

Details

If the user's request requires external data or actions, the LLM generates function calls. These might query databases, check availability, or update records.

Example

call: check_availability({date: "2024-01-16", time: "15:00"})
5~100-300ms

5. Response Generation

Generate natural language response

Details

The LLM generates a contextually appropriate response, incorporating any data retrieved from function calls. The response is crafted for spoken delivery.

Example

"I've found an opening tomorrow at 3pm. Shall I book that for you?"
6~100-200ms

6. Text-to-Speech

Convert response to audio

Details

Neural TTS systems like ElevenLabs or Play.ht convert the text response into natural-sounding speech with appropriate intonation, pacing, and emotion.

Example

Text → Natural audio waveform (streamed)

Total Round-Trip Time: ~500-800ms

Modern voice AI achieves sub-second response times through parallel processing, streaming, and optimized models. This feels natural in conversation - similar to the brief pause when talking to another person.

Step 2

Speech-to-Text (ASR)

Automatic Speech Recognition converts audio waveforms into text. This is one of the most computationally intensive steps, but modern models have achieved remarkable accuracy.

Streaming transcription

Text appears as you speak, not after

Noise cancellation

Works in cars, offices, public spaces

Accent adaptation

Handles regional accents and non-native speakers

Custom vocabulary

Learn industry-specific terms and names

Popular STT Providers Compared

FeatureWhisperDeepgramGoogle STT
Real-time streaming
Process audio as user speaks
Accuracy on clear audio
Word error rate (lower is better)
~5% WER~4% WER~6% WER
Noise handling
Performance in noisy environments
GoodExcellentGood
Language support
Number of supported languages
99+36100+
Speaker diarization
Identify different speakers
Latency (streaming)
Time to first word
~200ms~100ms~300ms
Step 4

Function Calling: Taking Action

This is where voice AI becomes truly powerful. Instead of just generating text, the LLM can decide to call external functions - booking appointments, checking inventory, updating records, or any custom action.

How It Works

  1. 1. LLM identifies user intent requires action
  2. 2. Model selects appropriate function from available tools
  3. 3. Extracts parameters from conversation context
  4. 4. System executes function and returns result
  5. 5. LLM incorporates result into response

Important: Function calls add latency. Design your system to minimize unnecessary calls and parallelize when possible.

Example Function Call (JSON)

LLM Output
1{
2 "function_call": {
3 "name": "book_appointment",
4 "arguments": {
5 "date": "2024-01-16",
6 "time": "15:00",
7 "service": "consultation",
8 "customer_name": "Sarah",
9 "notes": "First-time customer"
10 }
11 }
12}

The LLM generates this structured output when it determines the user wants to book an appointment. Your system parses this and executes the actual booking.

Step 6

Text-to-Speech (TTS)

Modern neural TTS has evolved far beyond robotic voices. Today's systems produce speech that's nearly indistinguishable from humans, with natural intonation, emotion, and pacing.

Voice cloning

Create a custom voice from just minutes of audio samples

Emotional expression

Adjust tone for empathy, enthusiasm, urgency

Streaming output

Audio starts playing before full sentence is generated

TTS Providers Compared

FeatureElevenLabsPlay.htAmazon Polly
Voice naturalness
How human-like the voice sounds
ExcellentVery GoodGood
Voice cloning
Create custom voices from samples
Emotional control
Adjust tone and emotion
Streaming support
Start playing before full generation
Latency
Time to first audio byte
~150ms~100ms~200ms
Real Example

A Complete Conversation Flow

Watch how all the pieces work together in a real interaction. This appointment rescheduling takes under 1.5 seconds total processing time.

Total: ~1151ms

"Hi, I need to reschedule my appointment from Friday to next Monday."

Audio captured at 16kHz

Deepgram processes audio stream in real-time.

"Hi, I need to reschedule my appointment from Friday to next Monday."
Intent detected:

reschedule_appointment

Entities extracted:
  • • Current date: Friday
  • • New date: next Monday

System checks availability and updates booking:

reschedule_booking(from: "Friday", to: "Monday") → Success

LLM crafts a natural, conversational response:

"I've rescheduled your appointment from Friday to Monday at the same time. You'll receive a confirmation email shortly. Is there anything else I can help with?"

ElevenLabs streams natural audio response.

Audio streaming begins immediately

Total Time: ~1.15 seconds

From the user finishing their sentence to hearing the AI's response. This includes a database lookup and update. The user experiences a natural conversational pause.

Where Things Can Go Wrong

Understanding failure modes helps you build more robust voice AI systems.

STT Misrecognition

Background noise, accents, or mumbling can cause transcription errors that cascade through the system.

Mitigation: Implement confirmation for critical actions, use custom vocabularies, allow corrections.

Latency Spikes

Network issues, model overload, or slow function calls can cause awkward delays that break conversation flow.

Mitigation: Use filler phrases, stream all audio, set timeouts, have fallback responses.

Context Loss

Long conversations can exceed context limits, causing the AI to "forget" earlier parts of the conversation.

Mitigation: Summarize context, store key facts externally, use RAG for conversation history.

Interruption Handling

Users interrupting the AI mid-sentence is natural but technically challenging to handle smoothly.

Mitigation: Use barge-in detection, stop audio playback, maintain conversation state.

Function Call Failures

External systems can fail, timeout, or return unexpected results that the AI must handle gracefully.

Mitigation: Set timeouts, have fallback responses, let AI explain issues naturally.

Unnatural TTS

Even good TTS can mispronounce names, numbers, or domain terms, breaking the illusion of natural speech.

Mitigation: Use phonetic hints, test edge cases, consider hybrid approaches.

Ready to Build Voice AI?

We build custom voice AI solutions that handle real conversations. From appointment booking to customer support, we create systems that sound natural and work reliably.