Measuring Conversational Quality | Voice & Chat Agent Engineering | Celestinosalim.com

Measuring Conversational Quality

The Failure

A team shipped a voice agent for a real estate company. The agent could answer questions about listings, schedule viewings, and describe neighborhoods. The team celebrated -- the technology worked. Three months later, usage had dropped 60%. The agent was generating responses, but nobody was booking viewings through it. Users asked one question, got an answer, and left.

The team had been tracking "conversations started" and "responses generated." Both numbers were fine. What they had not tracked was task completion: did the user actually schedule a viewing? Did they ask a follow-up question? Did they come back? When they finally instrumented these metrics, they found that 70% of conversations ended after one turn. The agent was answering questions but not advancing users toward their goal. Responses were too long for voice, the agent never proactively offered to schedule, and there was no follow-up prompt after delivering information.

The agent was not broken. It was unmeasured. This lesson teaches you what to measure and how to use those measurements to improve.

The Four Dimensions of Quality

Conversational quality breaks down into four measurable dimensions:

1. Task Completion Rate

What it measures: Did the user accomplish their goal?

This is the north star metric. A user asked a question -- did they get an answer? A user wanted to book an appointment -- did the booking happen?

How to measure it:

Explicit signals: The user clicks "this was helpful," completes a form, or triggers a conversion event.
Implicit signals: The user stops asking follow-up questions (got their answer), engages with a suggestion chip, or shares the conversation.
Absence signals: The user abandons mid-conversation, rephrases the same question repeatedly, or says "never mind."

// Track explicit positive signals
trackEvent('StoryStarted', { source: 'chip' });
trackEvent('QuestionAsked', { source: 'input' });

// Track implicit success: user engages with suggested actions
const handleChipClick = async (text: string) => {
  trackEvent('QuestionAsked', { source: 'chip' });
  await sendMessage({ text });
};

Target: 70-85% task completion for well-scoped agents. Below 70% indicates a fundamental design problem -- revisit your conversation design from Lesson 2.

2. Conversation Efficiency

What it measures: How many turns does it take to reach resolution?

An agent that answers in 2 turns what a competitor needs 6 turns for is objectively better -- assuming the answers are equally correct. Efficiency is the ratio of successful outcomes to conversational effort.

How to measure it:

Turns to resolution: Count messages from first user input to task completion signal.
Clarification rate: How often does the agent ask "did you mean X or Y?"
Repetition rate: How often does the user rephrase the same question?

Targets:

| Query Type | Target Turns | Red Flag | |------------|-------------|----------| | Simple lookup | 1-2 | More than 3 | | Complex question with context | 3-5 | More than 7 | | Guided workflow | Matches required steps | 2x required steps | | Clarification rate | Under 15% | Over 25% |

3. Response Latency

What it measures: How fast does the agent respond?

Latency is not one number. It is a distribution across multiple stages, and the right metric depends on the modality.

Chat latency metrics:

| Metric | Target | What It Tells You | |--------|--------|-------------------| | Time to first token (TTFT) | Under 500ms | Server processing + model startup speed | | Tokens per second | 30+ | Streaming feels fluid vs choppy | | Total response time | Informational | Depends on response length, not actionable |

Voice latency metrics:

| Metric | Target | What It Tells You | |--------|--------|-------------------| | Mouth-to-ear latency | Under 1000ms | Overall responsiveness | | STT latency | Under 300ms | Transcription speed | | LLM TTFT | Under 400ms | Model inference speed | | TTS TTFB | Under 150ms | Voice synthesis startup |

// Measure each pipeline stage
session.on(voice.AgentSessionEventTypes.UserInputTranscribed, (ev) => {
  if (ev.isFinal) {
    trackLatency('stt_complete', performance.now() - sttStartTime);
  }
});

session.on(voice.AgentSessionEventTypes.SpeechCreated, (ev) => {
  trackLatency('speech_started', performance.now() - turnStartTime);
});

The p95 matters more than the average. If your average latency is 600ms but your p95 is 3 seconds, one in twenty users is having a terrible experience. Optimize for the tail, not the median.

4. User Trust and Satisfaction

What it measures: Does the user believe and value the agent's responses?

Trust is subjective, but there are proxies:

Return rate: Do users come back? The strongest trust signal.
Conversation depth: Do users ask follow-up questions? Indicates engagement.
Escalation rate: How often do users ask for a human? High escalation = low trust.
Explicit feedback: Thumbs up/down, star ratings, "was this helpful?"

Research on conversational AI shows that effective fallback quality predicts 67% of customer satisfaction variance. How your agent handles failures matters more than how it handles successes. This is why Lesson 7 exists.

Building a Measurement Framework

Step 1: Instrument Events

Track structured events at key conversation moments:

trackEvent('ConversationStarted', {
  source: 'direct' | 'chip' | 'voice',
  isAuthenticated: boolean,
});

trackEvent('MessageSent', {
  role: 'user' | 'assistant',
  turnNumber: number,
  latencyMs: number,
  toolsUsed: string[],
});

trackEvent('ConversationEnded', {
  turnCount: number,
  durationSeconds: number,
  completionSignal: 'explicit' | 'implicit' | 'abandoned',
});

trackEvent('ErrorOccurred', {
  errorType: 'provider' | 'transcription' | 'rateLimit' | 'connection',
  recovered: boolean,
  fallbackUsed: boolean,
});

Step 2: Define Baselines

Before optimizing, establish baselines for your current performance:

| Metric | Baseline | Target | |--------|----------|--------| | Task completion rate | Measure first | 75%+ | | Average turns to resolution | Measure first | Under 4 | | Chat TTFT (p50) | Measure first | Under 500ms | | Voice mouth-to-ear (p50) | Measure first | Under 1000ms | | Clarification rate | Measure first | Under 15% | | Return rate (7-day) | Measure first | Over 30% | | Error recovery rate | Measure first | Over 70% |

Baselines give you ground truth. Without them, you are optimizing against intuition.

Step 3: Build Dashboards

Group metrics by the four dimensions:

+--------------------------------------------+
| TASK COMPLETION                            |
| * 78% completion rate (+3% vs last week)   |
| * 12% clarification rate                   |
| * 22% abandonment rate (target: <20%)      |
+--------------------------------------------+
| EFFICIENCY                                 |
| * 2.8 avg turns to resolution              |
| * 8% repetition rate                       |
+--------------------------------------------+
| LATENCY                                    |
| * Chat TTFT p50: 380ms, p95: 890ms        |
| * Voice p95: 1,400ms (target: <1,000ms)   |
+--------------------------------------------+
| TRUST                                      |
| * 34% 7-day return rate                    |
| * 3.2 avg conversation depth               |
| * 2% escalation rate                       |
+--------------------------------------------+

Step 4: Run Experiments

Use metrics to evaluate changes. Changed the system prompt? Measure task completion rate. Switched TTS providers? Measure voice latency p95. Added a new tool? Measure clarification rate (it should decrease if the tool is useful).

The feedback loop: change, measure, compare to baseline, ship or revert.

Automated Quality Evaluation

For scale, you need automated evaluation alongside user signals. LLM-as-judge is the current state of the art:

async function evaluateResponse(
  userQuery: string,
  agentResponse: string,
  retrievedContext: string
): Promise<QualityScore> {
  const { object } = await generateObject({
    model: google('gemini-2.5-flash'),
    schema: z.object({
      relevance: z.number().min(1).max(5)
        .describe('How relevant is the response to the query'),
      groundedness: z.number().min(1).max(5)
        .describe('Is the response grounded in the retrieved context'),
      completeness: z.number().min(1).max(5)
        .describe('Does the response fully address the query'),
      hallucination: z.boolean()
        .describe('Does the response contain claims not in the context'),
    }),
    prompt: `Evaluate this agent response.
User query: ${userQuery}
Retrieved context: ${retrievedContext}
Agent response: ${agentResponse}`,
  });

  return object;
}

Run this on a sample of conversations (not all -- it costs money) to get ongoing quality scores. Flag conversations where hallucination is true or relevance is below 3 for human review.

Build This

Set up a measurement framework for your agent:

Instrument the five event types above (ConversationStarted, MessageSent, ConversationEnded, ErrorOccurred, plus a custom task-specific event).
Run 20 test conversations (mix of simple and complex queries, at least 5 in voice mode if applicable).
Calculate baselines for all seven metrics in the baseline table.
Build a simple dashboard (even a spreadsheet) grouped by the four dimensions.
Implement the evaluateResponse function and run it on 10 conversations. Compare the automated scores with your manual assessment of quality.

What Good Looks Like

A well-tuned conversational agent hits these benchmarks:

| Metric | Target | What Failing Means | |--------|--------|--------------------| | Task completion | 75-85% | Conversation design problem (Lesson 2) | | Turns to resolution | 2-4 simple, 4-6 complex | Agent is not concise or is asking unnecessary clarifications | | Chat TTFT p95 | Under 1 second | Server-side processing bottleneck (Lesson 3) | | Voice mouth-to-ear p95 | Under 1.2 seconds | Pipeline stage too slow (Lesson 6) | | Clarification rate | Under 15% | Grounding or tool use problem (Lessons 2, 4) | | Error recovery rate | Over 70% | Degradation stack incomplete (Lesson 7) | | 7-day return rate | Over 30% | Trust problem -- check hallucination rate | | Hallucination rate | Under 5% | RAG grounding or prompt guardrails failing |

These are not theoretical. They are achievable with the techniques covered in this course.

Key Takeaways

Task completion rate is the north star. Everything else supports it.
Measure four dimensions: completion, efficiency, latency, and trust. They are correlated but not redundant.
The p95 matters more than the average -- outliers define the worst user experience.
Establish baselines before optimizing. You cannot improve what you have not measured.
Fallback quality predicts satisfaction more than happy-path quality. Invest in error handling.
Use LLM-as-judge for automated evaluation at scale, with human review for flagged conversations.

Course Conclusion

This concludes Voice & Chat Agent Engineering. Over eight lessons, you have learned to choose the right modality, design conversations that build trust, build streaming chat with session management, add tools and structured outputs, implement WebRTC voice with the Realtime API, compose modular voice pipelines with LiveKit, handle every category of failure gracefully, and measure whether any of it is actually working.

The models will keep getting better. The latency will keep dropping. New providers will emerge. But the fundamentals do not change: conversations are state machines, latency is the user experience, and silence is the worst error message. Build for those truths and the specifics will take care of themselves.