Start Lesson
A team shipped a voice agent for a real estate company. The agent could answer questions about listings, schedule viewings, and describe neighborhoods. The team celebrated -- the technology worked. Three months later, usage had dropped 60%. The agent was generating responses, but nobody was booking viewings through it. Users asked one question, got an answer, and left.
The team had been tracking "conversations started" and "responses generated." Both numbers were fine. What they had not tracked was task completion: did the user actually schedule a viewing? Did they ask a follow-up question? Did they come back? When they finally instrumented these metrics, they found that 70% of conversations ended after one turn. The agent was answering questions but not advancing users toward their goal. Responses were too long for voice, the agent never proactively offered to schedule, and there was no follow-up prompt after delivering information.
The agent was not broken. It was unmeasured. This lesson teaches you what to measure and how to use those measurements to improve.
Conversational quality breaks down into four measurable dimensions:
What it measures: Did the user accomplish their goal?
This is the north star metric. A user asked a question -- did they get an answer? A user wanted to book an appointment -- did the booking happen?
How to measure it:
// Track explicit positive signals
trackEvent('StoryStarted', { source: 'chip' });
trackEvent('QuestionAsked', { source: 'input' });
// Track implicit success: user engages with suggested actions
const handleChipClick = async (text: string) => {
trackEvent('QuestionAsked', { source: 'chip' });
await sendMessage({ text });
};
Target: 70-85% task completion for well-scoped agents. Below 70% indicates a fundamental design problem -- revisit your conversation design from Lesson 2.
What it measures: How many turns does it take to reach resolution?
An agent that answers in 2 turns what a competitor needs 6 turns for is objectively better -- assuming the answers are equally correct. Efficiency is the ratio of successful outcomes to conversational effort.
How to measure it:
Targets:
| Query Type | Target Turns | Red Flag | |------------|-------------|----------| | Simple lookup | 1-2 | More than 3 | | Complex question with context | 3-5 | More than 7 | | Guided workflow | Matches required steps | 2x required steps | | Clarification rate | Under 15% | Over 25% |
What it measures: How fast does the agent respond?
Latency is not one number. It is a distribution across multiple stages, and the right metric depends on the modality.
Chat latency metrics:
| Metric | Target | What It Tells You | |--------|--------|-------------------| | Time to first token (TTFT) | Under 500ms | Server processing + model startup speed | | Tokens per second | 30+ | Streaming feels fluid vs choppy | | Total response time | Informational | Depends on response length, not actionable |
Voice latency metrics:
| Metric | Target | What It Tells You | |--------|--------|-------------------| | Mouth-to-ear latency | Under 1000ms | Overall responsiveness | | STT latency | Under 300ms | Transcription speed | | LLM TTFT | Under 400ms | Model inference speed | | TTS TTFB | Under 150ms | Voice synthesis startup |
// Measure each pipeline stage
session.on(voice.AgentSessionEventTypes.UserInputTranscribed, (ev) => {
if (ev.isFinal) {
trackLatency('stt_complete', performance.now() - sttStartTime);
}
});
session.on(voice.AgentSessionEventTypes.SpeechCreated, (ev) => {
trackLatency('speech_started', performance.now() - turnStartTime);
});
The p95 matters more than the average. If your average latency is 600ms but your p95 is 3 seconds, one in twenty users is having a terrible experience. Optimize for the tail, not the median.
What it measures: Does the user believe and value the agent's responses?
Trust is subjective, but there are proxies:
Research on conversational AI shows that effective fallback quality predicts 67% of customer satisfaction variance. How your agent handles failures matters more than how it handles successes. This is why Lesson 7 exists.
Track structured events at key conversation moments:
trackEvent('ConversationStarted', {
source: 'direct' | 'chip' | 'voice',
isAuthenticated: boolean,
});
trackEvent('MessageSent', {
role: 'user' | 'assistant',
turnNumber: number,
latencyMs: number,
toolsUsed: string[],
});
trackEvent('ConversationEnded', {
turnCount: number,
durationSeconds: number,
completionSignal: 'explicit' | 'implicit' | 'abandoned',
});
trackEvent('ErrorOccurred', {
errorType: 'provider' | 'transcription' | 'rateLimit' | 'connection',
recovered: boolean,
fallbackUsed: boolean,
});
Before optimizing, establish baselines for your current performance:
| Metric | Baseline | Target | |--------|----------|--------| | Task completion rate | Measure first | 75%+ | | Average turns to resolution | Measure first | Under 4 | | Chat TTFT (p50) | Measure first | Under 500ms | | Voice mouth-to-ear (p50) | Measure first | Under 1000ms | | Clarification rate | Measure first | Under 15% | | Return rate (7-day) | Measure first | Over 30% | | Error recovery rate | Measure first | Over 70% |
Baselines give you ground truth. Without them, you are optimizing against intuition.
Group metrics by the four dimensions:
+--------------------------------------------+
| TASK COMPLETION |
| * 78% completion rate (+3% vs last week) |
| * 12% clarification rate |
| * 22% abandonment rate (target: <20%) |
+--------------------------------------------+
| EFFICIENCY |
| * 2.8 avg turns to resolution |
| * 8% repetition rate |
+--------------------------------------------+
| LATENCY |
| * Chat TTFT p50: 380ms, p95: 890ms |
| * Voice p95: 1,400ms (target: <1,000ms) |
+--------------------------------------------+
| TRUST |
| * 34% 7-day return rate |
| * 3.2 avg conversation depth |
| * 2% escalation rate |
+--------------------------------------------+
Use metrics to evaluate changes. Changed the system prompt? Measure task completion rate. Switched TTS providers? Measure voice latency p95. Added a new tool? Measure clarification rate (it should decrease if the tool is useful).
The feedback loop: change, measure, compare to baseline, ship or revert.
For scale, you need automated evaluation alongside user signals. LLM-as-judge is the current state of the art:
async function evaluateResponse(
userQuery: string,
agentResponse: string,
retrievedContext: string
): Promise<QualityScore> {
const { object } = await generateObject({
model: google('gemini-2.5-flash'),
schema: z.object({
relevance: z.number().min(1).max(5)
.describe('How relevant is the response to the query'),
groundedness: z.number().min(1).max(5)
.describe('Is the response grounded in the retrieved context'),
completeness: z.number().min(1).max(5)
.describe('Does the response fully address the query'),
hallucination: z.boolean()
.describe('Does the response contain claims not in the context'),
}),
prompt: `Evaluate this agent response.
User query: ${userQuery}
Retrieved context: ${retrievedContext}
Agent response: ${agentResponse}`,
});
return object;
}
Run this on a sample of conversations (not all -- it costs money) to get ongoing quality scores. Flag conversations where hallucination is true or relevance is below 3 for human review.
Set up a measurement framework for your agent:
ConversationStarted, MessageSent, ConversationEnded, ErrorOccurred, plus a custom task-specific event).evaluateResponse function and run it on 10 conversations. Compare the automated scores with your manual assessment of quality.A well-tuned conversational agent hits these benchmarks:
| Metric | Target | What Failing Means | |--------|--------|--------------------| | Task completion | 75-85% | Conversation design problem (Lesson 2) | | Turns to resolution | 2-4 simple, 4-6 complex | Agent is not concise or is asking unnecessary clarifications | | Chat TTFT p95 | Under 1 second | Server-side processing bottleneck (Lesson 3) | | Voice mouth-to-ear p95 | Under 1.2 seconds | Pipeline stage too slow (Lesson 6) | | Clarification rate | Under 15% | Grounding or tool use problem (Lessons 2, 4) | | Error recovery rate | Over 70% | Degradation stack incomplete (Lesson 7) | | 7-day return rate | Over 30% | Trust problem -- check hallucination rate | | Hallucination rate | Under 5% | RAG grounding or prompt guardrails failing |
These are not theoretical. They are achievable with the techniques covered in this course.
This concludes Voice & Chat Agent Engineering. Over eight lessons, you have learned to choose the right modality, design conversations that build trust, build streaming chat with session management, add tools and structured outputs, implement WebRTC voice with the Realtime API, compose modular voice pipelines with LiveKit, handle every category of failure gracefully, and measure whether any of it is actually working.
The models will keep getting better. The latency will keep dropping. New providers will emerge. But the fundamentals do not change: conversations are state machines, latency is the user experience, and silence is the worst error message. Build for those truths and the specifics will take care of themselves.