Start Lesson
A team built a voice agent by routing audio through their server: browser microphone to server via WebSocket, server calls a speech-to-text API, sends the transcript to an LLM, sends the response to a text-to-speech API, then streams audio back to the browser over the same WebSocket. Total round-trip: 2.8 seconds. The user asked a simple question and waited nearly 3 seconds in silence before hearing a response. It felt like talking to someone on a satellite phone.
The architecture was the problem. Every audio frame made a round trip through the server. TCP's guaranteed delivery meant dropped packets caused head-of-line blocking. The three-hop pipeline (STT, LLM, TTS) serialized latency instead of overlapping it. WebRTC and the OpenAI Realtime API solve this by eliminating the server proxy entirely -- audio goes directly from the browser to OpenAI's media edge over UDP.
WebRTC (Web Real-Time Communication) is the protocol that powers video calls in your browser. It uses UDP, which means packets arrive as fast as the network allows -- no waiting for TCP retransmission, no HTTP overhead, no buffering.
For voice AI, this means:
The alternative -- WebSockets over TCP -- adds overhead from guaranteed delivery, head-of-line blocking, and the extra hop through your server. For real-time audio, that overhead is the difference between "instant" and "broken."
The OpenAI Realtime API uses a "Control Plane" architecture. Your server does not proxy audio. It authenticates the session and hands the client a short-lived token to connect directly.
+------------+ 1. Request token +--------------+
| Browser | ----------------------> | Your Server |
| | <---------------------- | |
| | 2. Ephemeral key | (API key |
| | | stored |
| | 3. WebRTC connect | server- |
| | ----------------------> | side) |
| | +--------------+
| | 3. WebRTC connect
| | ----------------------> +--------------+
| | <======================>| OpenAI |
| | 4. Bidirectional | Realtime |
| | audio stream | Media Edge |
+------------+ +--------------+
The critical insight: your API key never touches the browser. The server mints an ephemeral key with limited scope and expiration, sends it to the client, and the client uses it to establish the WebRTC peer connection directly with OpenAI.
The server endpoint is lightweight. Its only job is authentication and token minting.
// app/api/realtime/token/route.ts
import { NextResponse } from 'next/server';
export async function POST(request: Request) {
const { model, voice, instructions } = await request.json();
// Mint an ephemeral session with the OpenAI REST API
const response = await fetch(
'https://api.openai.com/v1/realtime/sessions',
{
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: model || 'gpt-4o-realtime-preview',
voice: voice || 'verse',
instructions: instructions || 'You are a helpful assistant.',
input_audio_transcription: {
model: 'whisper-1',
},
tools: [
{
type: 'function',
name: 'search_knowledge',
description: 'Search the knowledge base',
parameters: {
type: 'object',
properties: {
query: { type: 'string' },
},
required: ['query'],
},
},
],
}),
}
);
const session = await response.json();
// session.client_secret.value is the ephemeral key
return NextResponse.json({
ephemeralKey: session.client_secret.value,
});
}
The ephemeral key expires after a short window. This is the security model -- even if it leaks, the blast radius is limited to a single session.
The client creates a WebRTC peer connection and connects using the ephemeral key. This is the bare-metal integration:
// hooks/useRealtimeVoice.ts
'use client';
import { useRef, useState, useCallback } from 'react';
export function useRealtimeVoice() {
const pcRef = useRef<RTCPeerConnection | null>(null);
const [isConnected, setIsConnected] = useState(false);
const [transcript, setTranscript] = useState('');
const connect = useCallback(async () => {
// 1. Get ephemeral key from your server
const tokenRes = await fetch('/api/realtime/token', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'gpt-4o-realtime-preview',
voice: 'verse',
instructions: 'You are a helpful voice assistant.',
}),
});
const { ephemeralKey } = await tokenRes.json();
// 2. Create WebRTC peer connection
const pc = new RTCPeerConnection();
pcRef.current = pc;
// 3. Set up audio playback -- remote track is the model's voice
pc.ontrack = (event) => {
const audio = new Audio();
audio.srcObject = event.streams[0];
audio.play();
};
// 4. Capture user's microphone
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
sampleRate: 24000,
},
});
stream.getTracks().forEach((track) => {
pc.addTrack(track, stream);
});
// 5. Create data channel for events (transcripts, tool calls)
const dc = pc.createDataChannel('oai-events');
dc.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'response.audio_transcript.delta') {
setTranscript((prev) => prev + data.delta);
}
if (data.type === 'conversation.item.input_audio_transcription.completed') {
console.log('User said:', data.transcript);
}
};
// 6. SDP offer/answer exchange
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const sdpResponse = await fetch(
'https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview',
{
method: 'POST',
headers: {
'Authorization': `Bearer ${ephemeralKey}`,
'Content-Type': 'application/sdp',
},
body: offer.sdp,
}
);
const answerSdp = await sdpResponse.text();
await pc.setRemoteDescription({
type: 'answer',
sdp: answerSdp,
});
setIsConnected(true);
}, []);
const disconnect = useCallback(() => {
pcRef.current?.close();
pcRef.current = null;
setIsConnected(false);
}, []);
return { connect, disconnect, isConnected, transcript };
}
In production, you will want to add: connection state monitoring (ICE gathering, connection failed), audio level visualization, graceful reconnection on network changes, and tool call handling via the data channel.
The WebRTC data channel carries structured events alongside the audio stream. This is how you receive transcripts, handle tool calls, and send configuration updates.
Key event types:
// Events you receive:
'response.audio_transcript.delta' // Partial model response text
'response.audio_transcript.done' // Complete model response text
'conversation.item.input_audio_transcription.completed' // User transcript
'response.function_call_arguments.done' // Tool call with arguments
// Events you send:
'conversation.item.create' // Inject context or tool results
'response.create' // Trigger a new response
'input_audio_buffer.clear' // Clear pending audio
Handling tool calls requires a round-trip through the data channel:
dc.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'response.function_call_arguments.done') {
const { call_id, name, arguments: args } = data;
const parsedArgs = JSON.parse(args);
// Execute the tool
executeToolCall(name, parsedArgs).then((result) => {
// Send the result back via the data channel
dc.send(JSON.stringify({
type: 'conversation.item.create',
item: {
type: 'function_call_output',
call_id,
output: JSON.stringify(result),
},
}));
// Trigger the model to continue responding
dc.send(JSON.stringify({ type: 'response.create' }));
});
}
};
Notice the conversation statefulness here. The call_id links the tool result back to the specific invocation. The response.create event tells the model to continue -- without it, the model waits indefinitely after the tool result. This is conversation state management at the protocol level.
| Factor | WebRTC | WebSocket | |--------|--------|-----------| | Client-side (browser) | Best choice | Acceptable | | Server-side (phone/SIP) | Not applicable | Required | | Latency | Lower (UDP, direct) | Higher (TCP, proxied) | | Audio quality control | Built-in (SRTP, DTLS) | Manual | | Network resilience | ICE, TURN fallback | Reconnect logic needed | | Implementation complexity | Higher | Lower | | Security model | Ephemeral keys | Standard API keys |
Use WebRTC when the user is in a browser or mobile app and latency matters. Use WebSocket when the agent runs server-side -- handling phone calls, SIP integrations, or batch processing.
The OpenAI Realtime API is a speech-to-speech model. Audio goes in, audio comes out, all from one model. This is fundamentally different from the pipeline approach (STT + LLM + TTS) covered in the next lesson.
| Dimension | Speech-to-Speech | Pipeline (STT + LLM + TTS) | |-----------|-----------------|---------------------------| | Latency | Lower (one model, one hop) | Higher (three hops, but overlappable) | | Prosody | Better (model "hears" tone) | Depends on TTS quality | | Architecture | Simpler | More moving parts | | Model flexibility | OpenAI only | Mix any providers | | Per-stage control | None | Full (swap STT, LLM, TTS independently) | | Voice options | Limited | Extensive (dedicated TTS providers) | | Cost | Higher per minute | Lower with careful provider selection |
For celestino.ai, I chose the pipeline approach (LiveKit + Gemini + ElevenLabs) because I wanted control over each stage. But if you want the fastest path to a working voice agent and are comfortable with OpenAI pricing, the Realtime API is genuinely impressive.
Build a minimal WebRTC voice agent:
RTCPeerConnection, capture the microphone, and complete the SDP offer/answer exchange.response.audio_transcript.delta events and display the transcript in real time.The Realtime API gives you a single model that does everything. But what if you want to choose your own STT, your own LLM, and your own TTS? Next, we cover LiveKit Voice Pipelines -- building modular voice agents where every stage is independently swappable, tunable, and measurable.