Building Voice Interfaces That Feel Instant
A two-second delay in a voice interface does not feel slow.
It feels broken. The user does not think "this is loading."
They think "this is not working." And then they leave.
I learned this building the voice agent for
celestino.ai. My first pipeline clocked in at 1.8 seconds end-to-end.
On paper, reasonable. In practice, every single test user
paused, repeated themselves, or started talking over the
response. Technically functional. Experientially dead.
This post is what I learned fixing it -- the systems thinking
behind why voice latency is a fundamentally different problem
than page load time, and what it takes to build voice
interfaces that feel like conversation.
The Latency Budget: Where Every Millisecond Goes
Human conversation operates on tight timing.
Psycholinguistic research shows that the average gap between
conversational turns is roughly 200 milliseconds. Pauses
as short as 300ms feel unnatural. Beyond 1.5 seconds, users
repeat themselves, assume the system failed, or disengage.
Now consider what a traditional voice AI pipeline must do in
that window:
| Stage | What It Does | Budget |
|-------|-------------|--------|
| STT | Capture audio, run speech recognition | 100-500ms |
| LLM Inference | Generate a response from transcript | 350ms-1s+ |
| TTS | Convert text back to audio | 75-200ms |
Add those up: 525ms to 1.7 seconds in the best case. That
does not include network hops, queuing, or the silence
detection that determines when the user has finished speaking.
In practice, a naive cascading pipeline lands between 2 and
4 seconds. That is not a voice interface. That is a
walkie-talkie.
The latency budget for voice is not about making each
component faster in isolation. It is about rethinking the
pipeline so that stages overlap, predictions run ahead of
certainty, and the user never perceives a gap.
The WebRTC Shift
For years, voice AI meant server-side processing. Audio goes
up to a server, gets transcribed, processed, synthesized, and
comes back down. Every step adds a network hop, and every hop
adds latency.
WebRTC changes this. Originally designed for peer-to-peer
video calling, WebRTC provides a battle-tested transport layer
running over UDP with built-in congestion control, packet loss
concealment, and adaptive bitrate. When OpenAI launched WebRTC
support for their Realtime API in late 2024, it eliminated
the architectural middleman.
The difference is stark. Traditional WebSocket architecture:
Client -> Your Backend -> OpenAI -> Your Backend -> Client
Every message traverses your server. Double-hop penalty. With
WebRTC, the client connects directly to the media edge:
Client -> OpenAI Media Edge -> Client
First partial text responses arrive in 150-250ms. First
audible synthesized phonemes in 220-400ms. That is
conversation-speed.
But there is a deeper shift here. The OpenAI Realtime API is
not just a faster pipe. It is a speech-to-speech model.
It does not decompose voice into text, reason over text, and
recompose text into voice. It operates on audio natively,
sidestepping the entire ASR-to-LLM-to-TTS chain. The latency
reduction is not incremental. It is structural.
The trade-off? Cost and control. Speech-to-speech models
charge roughly 10x what a cascading pipeline costs, partly
because the model re-processes accumulated context on every
turn. And when something breaks, you cannot inspect a
transcript or debug LLM reasoning. The pipeline is opaque.
For production voice agents where you need observability,
tool use, and cost efficiency, the cascading pipeline is still
the architecture to beat. You just have to make it fast.
LiveKit Voice Pipelines: Production Architecture
When I rebuilt the voice agent for celestino.ai, I chose
LiveKit's Agents SDK. The reason was pragmatic: LiveKit gives
you the cascading pipeline architecture with WebRTC transport
advantages, plus production-grade abstractions for the hard
problems -- turn detection, interruption handling, and
streaming orchestration.
Here is the core inference stack:
const stt = new inference.STT({
model: "elevenlabs/scribe_v2_realtime",
language: "en",
});
const llmModel = new inference.LLM({
model: "google/gemini-2.5-flash",
});
const tts = new inference.TTS({
model: "elevenlabs/eleven_flash_v2_5",
voice: "cjVigY5qzO86Huf0OWal",
language: "en",
});
Each component is chosen for speed at its stage.
ElevenLabs Scribe v2 is a streaming-first ASR model. Gemini
2.5 Flash is optimized for low time-to-first-token. ElevenLabs
Flash v2.5 is a TTS model built specifically for realtime
synthesis. You do not win on latency by picking the most
accurate model at each stage. You win by picking the
fastest model that clears your quality threshold.
Turn Detection: The Hardest Problem
The most latency-sensitive decision in a voice pipeline is
not inference speed. It is knowing when the user has stopped
talking.
Get it wrong in one direction: you cut the user off mid-
sentence. Wrong in the other: you add hundreds of
milliseconds of dead air after every utterance. Both feel
terrible.
The naive approach is pure Voice Activity Detection (VAD):
once silence exceeds a threshold, trigger the response. But
humans pause mid-thought. They hesitate. They take a breath
before the second half of a compound sentence. VAD alone
cannot distinguish "I am done talking" from "I am thinking
about what to say next."
For celestino.ai, I layer two systems. First, Silero VAD
provides raw voice activity signals:
const silero = await import("@livekit/agents-plugin-silero");
const vad = await silero.VAD.load();
Then, a transformer-based turn detector adds semantic
understanding on top. LiveKit's multilingual turn detector
evaluates whether a transcript fragment represents a
completed thought:
const livekitPlugin = await import("@livekit/agents-plugin-livekit");
const turnDetection = new livekitPlugin.turnDetector.MultilingualModel();
The turn detector runs inference in roughly 50ms -- fast
enough to operate in the gap between VAD detecting silence
and the system committing to a response. Combined, these two
layers let the agent distinguish between a pause and a period.
Interruption Handling: Respecting the User
Real conversations involve interruptions. If a user starts
speaking while the agent is mid-response, the agent needs to
stop immediately -- not finish its sentence and then listen.
The voice options make this explicit:
const session = new voice.AgentSession({
stt,
llm: llmModel,
tts,
vad,
turnDetection,
voiceOptions: {
minEndpointingDelay: 1000,
maxEndpointingDelay: 5000,
minInterruptionDuration: 800,
minInterruptionWords: 2,
preemptiveGeneration: true,
},
});
The minInterruptionDuration of 800ms and
minInterruptionWords of 2 prevent false interruptions from
background noise or brief "uh-huh" acknowledgments. But when
a genuine interruption comes, the agent yields immediately.
Human-centric design: the system should never talk over the
user.
Optimization Techniques That Actually Matter
Beyond architecture choices, specific techniques shave
critical milliseconds from the pipeline.
Streaming Everything
The single biggest optimization: never wait for a complete
result before starting the next stage. Streaming ASR feeds
partial transcripts to the LLM. The LLM streams tokens to
TTS. TTS streams audio chunks to the client. Each stage
begins before the previous one ends. Switching any single
component to batch processing can double your end-to-end
latency.
Speculative Prefetch (Preemptive Generation)
Notice the preemptiveGeneration: true flag in the session
config. One of the most impactful optimizations available.
When enabled, the agent begins LLM and TTS inference as soon
as a user transcript arrives, before the turn detector has
confirmed the user is done speaking.
If the user was done, you have saved hundreds of
milliseconds. If the user continues, the speculative result
is discarded and regenerated with the complete input. You pay
a cost in wasted compute, but the perceived latency
improvement is dramatic.
Same principle behind speculative execution in CPUs and
speculative decoding in LLMs. Bet on the likely outcome. Pay
the cheap cost of being wrong occasionally to gain the
expensive benefit of being right most of the time.
Regional Deployment
Physics is non-negotiable. A round trip from Miami to
us-east-1 takes ~30ms. A round trip to eu-west-1 takes
~120ms. For a pipeline making multiple sequential network
calls, those extra 90ms per hop compound fast.
Deploy your agent servers in the same region as your users,
co-located with your inference providers. LiveKit's cloud
infrastructure routes through their global edge network, but
your LLM and TTS endpoints matter just as much.
Connection Warmup
On celestino.ai, the LiveKit room connection is established
the moment the user clicks the voice button -- not when they
start speaking:
useEffect(() => {
(async () => {
const resp = await fetch(
`/api/token?roomName=${targetRoom}&participantName=User`
);
const data = await resp.json();
setToken(data.token);
setUrl(data.url);
})();
}, [targetRoom]);
By the time the user grants microphone permissions and starts
talking, the WebRTC connection is already live, the agent
process is already running, and the first audio frame flows
without setup delay.
The UX Layer: Perceived vs. Measured Latency
Latency optimization is necessary but not sufficient. A voice
interface that responds in 400ms but provides no feedback
during those 400ms still feels broken. The UX layer bridges
measured latency and perceived latency.
Visual Feedback During Processing
On celestino.ai, the interface provides continuous visual
state through an animated orb that responds to audio in real
time:
const { state, audioTrack: agentAudioTrack } = useVoiceAssistant();
// States: 'listening' | 'thinking' | 'speaking' | 'idle'
When the user speaks, the orb reacts to their voice
amplitude. When the agent thinks, it shifts to a processing
animation. When the agent speaks, the orb syncs with its
audio output. There is never a moment where the interface
appears frozen.
This is not decoration. It is functional communication. The
visual feedback tells the user "I heard you, I am working on
it" in the gap before audio begins. That gap goes from
feeling like dead air to feeling like a natural
conversational pause.
Graceful Degradation
Not every user can use voice. Not every environment is
appropriate for it. The celestino.ai interface supports full
text chat alongside voice, with messages synced between both
modes via LiveKit data channels:
room.on(RoomEvent.DataReceived, (payload: Uint8Array) => {
const data = JSON.parse(new TextDecoder().decode(payload));
if (data.type === 'chat_update' && data.message) {
onMessageReceived(data.message);
}
});
When the agent speaks, the transcript appears in the chat
panel. When the user types, the text goes through the same
LLM pipeline. The voice interface is an enhancement, not a
requirement. Reliability in practice: the system works well
in ideal conditions and still works in degraded ones.
Handling Ambient Noise
Voice interfaces that work in quiet rooms are demos. Voice
interfaces that work in coffee shops are products. Noise
cancellation runs at the input layer:
const ncModule = await import("@livekit/noise-cancellation-node");
const noiseCancellation = ncModule.BackgroundVoiceCancellation();
Combined with transcript filtering that ignores low-signal
audio (stray sounds, non-English fragments, sub-two-character
noise), the agent maintains conversational coherence even in
imperfect acoustic environments. Systems that perform
reliably in the conditions real users actually encounter.
Celestino.ai: The Living Case Study
The voice agent on
celestino.ai
ties all of these ideas together. A conversational AI powered
by a RAG pipeline that retrieves relevant context from a
Supabase vector store before generating each response.
The stack:
- Transport: LiveKit Agents SDK, TypeScript agent process
- STT: ElevenLabs Scribe v2 (streaming-first ASR)
- LLM: Gemini 2.5 Flash (low time-to-first-token)
- TTS: ElevenLabs Flash v2.5 (realtime synthesis)
- Turn detection: Silero VAD + LiveKit transformer model
- Optimizations: Preemptive generation, noise
cancellation, Supabase chat history persistence
- Frontend: LiveKit React components, Web Audio API
analyzer, dual-mode voice + text interface
The result: a voice interface that typically responds in
under a second. Not because any single component is
uniquely fast, but because every component is chosen for
speed, every stage streams into the next, and the UX layer
masks whatever latency remains.
What to Measure
If you are building voice interfaces, track these metrics:
| Metric | What It Measures | Target |
|--------|-----------------|--------|
| TTFB | End-of-speech to first audio byte | Under 500ms |
| End-to-End Latency | Full round trip | Under 1.5s |
| Interruption Response | Time to stop on interruption | Under 300ms |
| Turn Detection Accuracy | End-of-turn vs. mid-pause | Track FP and FN rates |
| Fallback Rate | Voice-to-text switches mid-session | Lower is better |
Measure these in production, not in controlled tests. The gap
between lab conditions and real-world acoustic environments is
where voice interfaces fail.
The Takeaway
Building voice interfaces that feel instant is not about one
optimization. It is a systems problem:
- Transport: WebRTC, not WebSockets
- Pipeline: Streaming cascaded (or speech-to-speech if
budget allows)
- Turn detection: Semantic, not just acoustic
- Speculation: Preemptive generation -- bet on the likely
outcome
- UX: Visual feedback that turns measured latency into
perceived responsiveness
The voice agent on celestino.ai is proof that this is
achievable today, with production-grade open source tooling,
without a research team or custom ASICs. The question is no
longer "can we build voice interfaces that feel instant?" It
is "are we willing to do the systems work to make them
reliable?"
Voice is the most natural human interface. When it works, it
disappears. When it does not, nothing else about your product
matters.
Talk to my AI and experience the sub-second response yourself.