WebRTC & the OpenAI Realtime API | Voice & Chat Agent Engineering | Celestinosalim.com

WebRTC & the OpenAI Realtime API

The Failure

A team built a voice agent by routing audio through their server: browser microphone to server via WebSocket, server calls a speech-to-text API, sends the transcript to an LLM, sends the response to a text-to-speech API, then streams audio back to the browser over the same WebSocket. Total round-trip: 2.8 seconds. The user asked a simple question and waited nearly 3 seconds in silence before hearing a response. It felt like talking to someone on a satellite phone.

The architecture was the problem. Every audio frame made a round trip through the server. TCP's guaranteed delivery meant dropped packets caused head-of-line blocking. The three-hop pipeline (STT, LLM, TTS) serialized latency instead of overlapping it. WebRTC and the OpenAI Realtime API solve this by eliminating the server proxy entirely -- audio goes directly from the browser to OpenAI's media edge over UDP.

Why WebRTC Matters

WebRTC (Web Real-Time Communication) is the protocol that powers video calls in your browser. It uses UDP, which means packets arrive as fast as the network allows -- no waiting for TCP retransmission, no HTTP overhead, no buffering.

For voice AI, this means:

Audio goes directly from the browser to OpenAI's media edge. No server proxy for audio data.
Latency drops by 200-300ms compared to routing audio through your server.
Built-in congestion control and packet loss concealment handle poor networks gracefully.

The alternative -- WebSockets over TCP -- adds overhead from guaranteed delivery, head-of-line blocking, and the extra hop through your server. For real-time audio, that overhead is the difference between "instant" and "broken."

Architecture: The Control Plane Pattern

The OpenAI Realtime API uses a "Control Plane" architecture. Your server does not proxy audio. It authenticates the session and hands the client a short-lived token to connect directly.

+------------+    1. Request token     +--------------+
|  Browser   | ----------------------> |  Your Server |
|            | <---------------------- |              |
|            |    2. Ephemeral key     |  (API key    |
|            |                         |   stored     |
|            |    3. WebRTC connect    |   server-    |
|            | ----------------------> |   side)      |
|            |                         +--------------+
|            |    3. WebRTC connect
|            | ----------------------> +--------------+
|            | <======================>|  OpenAI      |
|            |    4. Bidirectional     |  Realtime    |
|            |       audio stream      |  Media Edge  |
+------------+                         +--------------+

The critical insight: your API key never touches the browser. The server mints an ephemeral key with limited scope and expiration, sends it to the client, and the client uses it to establish the WebRTC peer connection directly with OpenAI.

Implementation: Server Side

The server endpoint is lightweight. Its only job is authentication and token minting.

// app/api/realtime/token/route.ts
import { NextResponse } from 'next/server';

export async function POST(request: Request) {
  const { model, voice, instructions } = await request.json();

  // Mint an ephemeral session with the OpenAI REST API
  const response = await fetch(
    'https://api.openai.com/v1/realtime/sessions',
    {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        model: model || 'gpt-4o-realtime-preview',
        voice: voice || 'verse',
        instructions: instructions || 'You are a helpful assistant.',
        input_audio_transcription: {
          model: 'whisper-1',
        },
        tools: [
          {
            type: 'function',
            name: 'search_knowledge',
            description: 'Search the knowledge base',
            parameters: {
              type: 'object',
              properties: {
                query: { type: 'string' },
              },
              required: ['query'],
            },
          },
        ],
      }),
    }
  );

  const session = await response.json();

  // session.client_secret.value is the ephemeral key
  return NextResponse.json({
    ephemeralKey: session.client_secret.value,
  });
}

The ephemeral key expires after a short window. This is the security model -- even if it leaks, the blast radius is limited to a single session.

Implementation: Client Side

The client creates a WebRTC peer connection and connects using the ephemeral key. This is the bare-metal integration:

// hooks/useRealtimeVoice.ts
'use client';
import { useRef, useState, useCallback } from 'react';

export function useRealtimeVoice() {
  const pcRef = useRef<RTCPeerConnection | null>(null);
  const [isConnected, setIsConnected] = useState(false);
  const [transcript, setTranscript] = useState('');

  const connect = useCallback(async () => {
    // 1. Get ephemeral key from your server
    const tokenRes = await fetch('/api/realtime/token', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: 'gpt-4o-realtime-preview',
        voice: 'verse',
        instructions: 'You are a helpful voice assistant.',
      }),
    });
    const { ephemeralKey } = await tokenRes.json();

    // 2. Create WebRTC peer connection
    const pc = new RTCPeerConnection();
    pcRef.current = pc;

    // 3. Set up audio playback -- remote track is the model's voice
    pc.ontrack = (event) => {
      const audio = new Audio();
      audio.srcObject = event.streams[0];
      audio.play();
    };

    // 4. Capture user's microphone
    const stream = await navigator.mediaDevices.getUserMedia({
      audio: {
        echoCancellation: true,
        noiseSuppression: true,
        sampleRate: 24000,
      },
    });
    stream.getTracks().forEach((track) => {
      pc.addTrack(track, stream);
    });

    // 5. Create data channel for events (transcripts, tool calls)
    const dc = pc.createDataChannel('oai-events');
    dc.onmessage = (event) => {
      const data = JSON.parse(event.data);
      if (data.type === 'response.audio_transcript.delta') {
        setTranscript((prev) => prev + data.delta);
      }
      if (data.type === 'conversation.item.input_audio_transcription.completed') {
        console.log('User said:', data.transcript);
      }
    };

    // 6. SDP offer/answer exchange
    const offer = await pc.createOffer();
    await pc.setLocalDescription(offer);

    const sdpResponse = await fetch(
      'https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview',
      {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${ephemeralKey}`,
          'Content-Type': 'application/sdp',
        },
        body: offer.sdp,
      }
    );

    const answerSdp = await sdpResponse.text();
    await pc.setRemoteDescription({
      type: 'answer',
      sdp: answerSdp,
    });

    setIsConnected(true);
  }, []);

  const disconnect = useCallback(() => {
    pcRef.current?.close();
    pcRef.current = null;
    setIsConnected(false);
  }, []);

  return { connect, disconnect, isConnected, transcript };
}

In production, you will want to add: connection state monitoring (ICE gathering, connection failed), audio level visualization, graceful reconnection on network changes, and tool call handling via the data channel.

The Data Channel: Events and Tool Calls

The WebRTC data channel carries structured events alongside the audio stream. This is how you receive transcripts, handle tool calls, and send configuration updates.

Key event types:

// Events you receive:
'response.audio_transcript.delta'    // Partial model response text
'response.audio_transcript.done'     // Complete model response text
'conversation.item.input_audio_transcription.completed'  // User transcript
'response.function_call_arguments.done'  // Tool call with arguments

// Events you send:
'conversation.item.create'           // Inject context or tool results
'response.create'                    // Trigger a new response
'input_audio_buffer.clear'           // Clear pending audio

Handling tool calls requires a round-trip through the data channel:

dc.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.type === 'response.function_call_arguments.done') {
    const { call_id, name, arguments: args } = data;
    const parsedArgs = JSON.parse(args);

    // Execute the tool
    executeToolCall(name, parsedArgs).then((result) => {
      // Send the result back via the data channel
      dc.send(JSON.stringify({
        type: 'conversation.item.create',
        item: {
          type: 'function_call_output',
          call_id,
          output: JSON.stringify(result),
        },
      }));
      // Trigger the model to continue responding
      dc.send(JSON.stringify({ type: 'response.create' }));
    });
  }
};

Notice the conversation statefulness here. The call_id links the tool result back to the specific invocation. The response.create event tells the model to continue -- without it, the model waits indefinitely after the tool result. This is conversation state management at the protocol level.

WebRTC vs. WebSocket: When to Use Which

| Factor | WebRTC | WebSocket | |--------|--------|-----------| | Client-side (browser) | Best choice | Acceptable | | Server-side (phone/SIP) | Not applicable | Required | | Latency | Lower (UDP, direct) | Higher (TCP, proxied) | | Audio quality control | Built-in (SRTP, DTLS) | Manual | | Network resilience | ICE, TURN fallback | Reconnect logic needed | | Implementation complexity | Higher | Lower | | Security model | Ephemeral keys | Standard API keys |

Use WebRTC when the user is in a browser or mobile app and latency matters. Use WebSocket when the agent runs server-side -- handling phone calls, SIP integrations, or batch processing.

Speech-to-Speech vs. Pipeline

The OpenAI Realtime API is a speech-to-speech model. Audio goes in, audio comes out, all from one model. This is fundamentally different from the pipeline approach (STT + LLM + TTS) covered in the next lesson.

| Dimension | Speech-to-Speech | Pipeline (STT + LLM + TTS) | |-----------|-----------------|---------------------------| | Latency | Lower (one model, one hop) | Higher (three hops, but overlappable) | | Prosody | Better (model "hears" tone) | Depends on TTS quality | | Architecture | Simpler | More moving parts | | Model flexibility | OpenAI only | Mix any providers | | Per-stage control | None | Full (swap STT, LLM, TTS independently) | | Voice options | Limited | Extensive (dedicated TTS providers) | | Cost | Higher per minute | Lower with careful provider selection |

For celestino.ai, I chose the pipeline approach (LiveKit + Gemini + ElevenLabs) because I wanted control over each stage. But if you want the fastest path to a working voice agent and are comfortable with OpenAI pricing, the Realtime API is genuinely impressive.

Build This

Build a minimal WebRTC voice agent:

Create a server endpoint that mints ephemeral tokens from the OpenAI Realtime API.
On the client, create a RTCPeerConnection, capture the microphone, and complete the SDP offer/answer exchange.
Set up the data channel to receive response.audio_transcript.delta events and display the transcript in real time.
Add one tool (knowledge base search or weather lookup) and handle the tool call round-trip through the data channel.
Test the latency: measure the time from when you stop speaking to when you hear the first audio response. Target: under 1 second.

Key Takeaways

WebRTC eliminates the server proxy for audio, reducing latency by 200-300ms compared to WebSocket routing.
The Control Plane pattern keeps your API key server-side while giving the client a short-lived ephemeral token.
The data channel carries transcripts, tool calls, and configuration alongside audio -- all as part of the conversation state.
WebRTC for browsers, WebSocket for servers. Match the transport to the deployment context.
Speech-to-speech is simpler but less flexible than the STT/LLM/TTS pipeline.
Tool calls work over the data channel -- execute locally, send results back, trigger continuation.

What's Next

The Realtime API gives you a single model that does everything. But what if you want to choose your own STT, your own LLM, and your own TTS? Next, we cover LiveKit Voice Pipelines -- building modular voice agents where every stage is independently swappable, tunable, and measurable.