Building Voice Interfaces That Feel Instant

A two-second delay in a voice interface does not feel slow. It feels broken. The user does not think "this is loading." They think "this is not working." And then they leave.

I learned this the hard way. When I built the first version of the voice agent for celestino.ai, my pipeline clocked in at around 1.8 seconds end-to-end. On paper, that seemed reasonable. In practice, every single test user paused, repeated themselves, or just started talking over the response. The interface was technically functional and experientially dead.

This post is about what I learned fixing it. Not just the engineering, but the systems thinking behind why voice latency is a fundamentally different problem than page load time, and what it takes to build voice interfaces that feel like conversation instead of command-and-response.

The Latency Budget: Where Every Millisecond Goes

Human conversation operates on tight timing. Psycholinguistic research shows that the average gap between conversational turns is roughly 200 milliseconds. Pauses as short as 300ms can feel unnatural. Anything beyond 1.5 seconds and the experience degrades rapidly. Users either repeat themselves, assume the system failed, or disengage entirely.

Now consider what a traditional voice AI pipeline has to do in that window:

Speech-to-Text (STT): Capture audio, run automatic speech recognition. Budget: 100-500ms.
LLM Inference: Send the transcript to a language model, generate a response. Budget: 350ms-1s+.
Text-to-Speech (TTS): Convert the generated text back to audio. Budget: 75-200ms.

Add those up and you are looking at 525ms to 1.7 seconds in the best case. That does not include network hops, queuing, or the silence detection that determines when the user has actually finished speaking. In practice, a naive cascading pipeline lands somewhere between 2 and 4 seconds. That is not a voice interface. That is a walkie-talkie.

The latency budget for voice is not about making each component faster in isolation. It is about rethinking the entire pipeline so that stages overlap, predictions run ahead of certainty, and the user never perceives a gap.

The WebRTC Revolution

For years, voice AI meant server-side processing. Audio goes up to a server, gets transcribed, processed, synthesized, and comes back down. Every step adds a network hop, and every hop adds latency.

WebRTC changes the game. Originally designed for peer-to-peer video calling, WebRTC provides a battle-tested transport layer that runs over UDP with built-in congestion control, packet loss concealment, and adaptive bitrate. When OpenAI launched WebRTC support for their Realtime API in late 2025, it eliminated the architectural middleman.

The difference is stark. In a traditional WebSocket architecture, the flow looks like this:

Client -> Your Backend -> OpenAI -> Your Backend -> Client

Every message traverses your server. That is a double-hop penalty. With WebRTC, the client connects directly to OpenAI's media edge:

Client -> OpenAI Media Edge -> Client

First partial text responses arrive in 150-250ms. First audible synthesized phonemes in 220-400ms. That is conversation-speed.

But there is a deeper architectural shift happening here. The OpenAI Realtime API is not just a faster pipe. It is a speech-to-speech model. It does not decompose voice into text, reason over text, and recompose text into voice. It operates on audio natively, which means it sidesteps the entire ASR-to-LLM-to-TTS chain. The latency reduction is not incremental. It is structural.

The trade-off? Cost and control. Speech-to-speech models charge roughly 10x what a cascading pipeline costs, partly because the model re-processes accumulated context on every turn. And when something breaks, you cannot inspect a transcript or debug LLM reasoning. The pipeline is opaque.

For production voice agents where you need observability, tool use, and cost efficiency, the cascading pipeline is still the architecture to beat. You just have to make it fast.

LiveKit Voice Pipelines: Production-Grade Architecture

When I rebuilt the voice agent for celestino.ai, I chose LiveKit's Agents SDK. The reason was pragmatic: LiveKit gives you the cascading pipeline architecture with the transport advantages of WebRTC, plus production-grade abstractions for the hard problems, namely turn detection, interruption handling, and streaming orchestration.

Here is the core of how the agent initializes its inference stack:

const stt = new inference.STT({
  model: "elevenlabs/scribe_v2_realtime",
  language: "en",
});

const llmModel = new inference.LLM({
  model: "google/gemini-2.5-flash",
});

const tts = new inference.TTS({
  model: "elevenlabs/eleven_flash_v2_5",
  voice: "cjVigY5qzO86Huf0OWal",
  language: "en",
});

Each component is chosen for speed at its stage. ElevenLabs Scribe v2 is a streaming-first ASR model. Gemini 2.5 Flash is optimized for low time-to-first-token. ElevenLabs Flash v2.5 is a TTS model built specifically for realtime synthesis. You do not win on latency by picking the most accurate model at each stage. You win by picking the fastest model that clears your quality threshold.

Turn Detection: The Hardest Problem

The most latency-sensitive decision in a voice pipeline is not inference speed. It is knowing when the user has stopped talking.

Get it wrong in one direction, and you cut the user off mid-sentence. Get it wrong in the other, and you add hundreds of milliseconds of dead air after every utterance. Both feel terrible.

The naive approach is pure Voice Activity Detection (VAD): once silence exceeds a threshold, trigger the response. But humans pause mid-thought. They hesitate. They take a breath before the second half of a compound sentence. VAD alone cannot distinguish between "I am done talking" and "I am thinking about what to say next."

For celestino.ai, I layer two systems. First, Silero VAD provides raw voice activity signals:

const silero = await import("@livekit/agents-plugin-silero");
const vad = await silero.VAD.load();

Then, a transformer-based turn detector adds semantic understanding on top. LiveKit's multilingual turn detector is a custom language model that evaluates whether a transcript fragment represents a completed thought:

const livekitPlugin = await import("@livekit/agents-plugin-livekit");
const turnDetection = new livekitPlugin.turnDetector.MultilingualModel();

The turn detector runs inference in roughly 50ms. That is fast enough to operate in the gap between VAD detecting silence and the system committing to a response. Combined, these two layers let the agent distinguish between a pause and a period.

Interruption Handling: Respecting the User

Real conversations involve interruptions. If a user starts speaking while the agent is mid-response, the agent needs to stop immediately, not finish its sentence and then listen.

The voice options I configure make this explicit:

const session = new voice.AgentSession({
  stt,
  llm: llmModel,
  tts,
  vad,
  turnDetection,
  voiceOptions: {
    minEndpointingDelay: 1000,
    maxEndpointingDelay: 5000,
    minInterruptionDuration: 800,
    minInterruptionWords: 2,
    preemptiveGeneration: true,
  },
});

The minInterruptionDuration of 800ms and minInterruptionWords of 2 prevent false interruptions from background noise or brief acknowledgments like "uh-huh." But when a genuine interruption comes, the agent yields immediately. This is a human-centric design decision: the system should never talk over the user.

Optimization Techniques That Actually Matter

Beyond architecture, there are specific techniques that shave critical milliseconds from the pipeline.

Streaming Everything

The single biggest optimization is never waiting for a complete result before starting the next stage. Streaming ASR feeds partial transcripts to the LLM. The LLM streams tokens to TTS. TTS streams audio chunks to the client. Each stage begins before the previous one ends. Switching any single component to batch processing, where it waits for the full input, can double your end-to-end latency.

Speculative Prefetch (Preemptive Generation)

Notice the preemptiveGeneration: true flag in the session config. This is one of the most impactful optimizations available. When enabled, the agent begins LLM and TTS inference as soon as a user transcript arrives, before the turn detector has confirmed the user is done speaking.

If the user was indeed done, you have saved potentially hundreds of milliseconds. If the user continues speaking, the speculative result is discarded and regenerated with the complete input. You pay a cost in wasted compute, but the perceived latency improvement is dramatic.

This is the same principle behind speculative execution in CPUs and speculative decoding in LLMs. Bet on the likely outcome. Pay the cheap cost of being wrong occasionally to gain the expensive benefit of being right most of the time.

Regional Deployment

Physics is non-negotiable. A round trip from Miami to a server in us-east-1 takes roughly 30ms. A round trip to eu-west-1 takes 120ms. For a pipeline that makes multiple sequential network calls, those extra 90ms per hop compound quickly.

Deploy your agent servers in the same region as your users, and co-locate them with your inference providers. LiveKit's cloud infrastructure helps here by routing through their global edge network, but your LLM and TTS endpoints matter just as much.

Connection Warmup

On celestino.ai, the LiveKit room connection is established the moment the user clicks the voice button, not when they start speaking. The token endpoint returns immediately:

useEffect(() => {
  (async () => {
    const resp = await fetch(
      `/api/token?roomName=${targetRoom}&participantName=User`
    );
    const data = await resp.json();
    setToken(data.token);
    setUrl(data.url);
  })();
}, [roomId]);

By the time the user has granted microphone permissions and started talking, the WebRTC connection is already live, the agent process is already running, and the first audio frame can flow without setup delay.

The UX Layer: What Makes Voice Feel Right

Latency optimization is necessary but not sufficient. A voice interface that responds in 400ms but provides no feedback during those 400ms still feels broken. The UX layer is what bridges the gap between measured latency and perceived latency.

Visual Feedback During Processing

On celestino.ai, the interface provides continuous visual state through an animated orb that responds to audio in real time:

const { state, audioTrack: agentAudioTrack } = useVoiceAssistant();
// States: 'listening' | 'thinking' | 'speaking' | 'idle'

When the user is speaking, the orb reacts to their voice amplitude. When the agent is thinking, it shifts to a processing animation. When the agent speaks, the orb syncs with the agent's audio output. There is never a moment where the interface appears frozen or unresponsive.

This is not decoration. It is functional communication. The visual feedback tells the user "I heard you, I am working on it" in the gap before audio begins, and that gap goes from feeling like dead air to feeling like a natural conversational pause.

Graceful Degradation

Not every user can use voice. Not every environment is appropriate for it. The celestino.ai interface supports a full text chat alongside voice, with messages synced between both modes via LiveKit data channels:

room.on(RoomEvent.DataReceived, (payload: Uint8Array) => {
  const data = JSON.parse(new TextDecoder().decode(payload));
  if (data.type === 'chat_update' && data.message) {
    onMessageReceived(data.message);
  }
});

When the agent speaks, the transcript appears in the chat panel. When the user types, the text goes through the same LLM pipeline. The voice interface is an enhancement, not a requirement. This is reliability in practice: the system works well in ideal conditions and still works in degraded ones.

Handling Ambient Noise

Voice interfaces that work in quiet rooms are demos. Voice interfaces that work in coffee shops are products. Background noise cancellation runs at the input layer:

const ncModule = await import("@livekit/noise-cancellation-node");
const noiseCancellation = ncModule.BackgroundVoiceCancellation();

Combined with the transcript filtering that ignores low-signal audio (stray sounds, non-English fragments, sub-two-character noise), the agent maintains conversational coherence even in imperfect acoustic environments. This is what I mean by hardened AI: systems that perform reliably in the conditions real users actually encounter.

Celestino.ai: The Living Case Study

The voice agent on celestino.ai ties all of these ideas together. It is a conversational AI that knows about my work, my projects, and my perspective, powered by a RAG pipeline that retrieves relevant context from a Supabase vector store before generating each response.

The architecture: LiveKit Agents SDK running a TypeScript agent process. ElevenLabs for both STT and TTS. Gemini 2.5 Flash for inference, specifically chosen for its low time-to-first-token in voice mode. Silero VAD plus LiveKit's transformer-based turn detector. Preemptive generation enabled. Noise cancellation active. Chat history persisted to Supabase so conversations survive reconnection.

The frontend: LiveKit React components handling the WebRTC connection, a Web Audio API-based analyzer driving the visual feedback, and a dual-mode interface that supports both voice and text seamlessly.

The result is a voice interface that typically responds in under a second. Not because any single component is uniquely fast, but because every component is chosen for speed, every stage streams into the next, and the UX layer masks whatever latency remains.

What to Measure

If you are building voice interfaces, here are the metrics that matter:

Time-to-First-Byte (TTFB): How long from end-of-user-speech to the first audio byte of the response. Target: under 500ms.
End-to-End Latency: Full round trip from user utterance to completed agent response. Target: under 1.5 seconds for most turns.
Interruption Success Rate: When the user interrupts, how quickly does the agent stop? Target: under 300ms.
Turn Detection Accuracy: How often does the system correctly identify end-of-turn versus mid-utterance pause? Track false positives (cutting user off) and false negatives (unnecessary silence).
Fallback Rate: How often do users switch from voice to text mid-session? A high rate signals UX or reliability problems.

Measure these in production, not in controlled tests. The gap between lab conditions and real-world acoustic environments is where voice interfaces fail.

The Takeaway

Building voice interfaces that feel instant is not about finding one silver bullet optimization. It is a systems problem. You need the right transport layer (WebRTC), the right pipeline architecture (streaming cascaded or speech-to-speech), the right turn detection (semantic, not just acoustic), aggressive speculation (preemptive generation), and a UX layer that turns measured latency into perceived responsiveness.

The voice agent on celestino.ai is my proof-of-concept that this is achievable today, with production-grade open source tooling, without a research team or custom ASICs. The infrastructure is here. The question is no longer "can we build voice interfaces that feel instant?" It is "are we willing to do the systems work to make them reliable?"

I think the answer matters. Voice is the most natural human interface. When it works, it disappears. When it does not, nothing else about your product matters. Build it right or do not build it at all.