Tool Use & Structured Outputs | Voice & Chat Agent Engineering | Celestinosalim.com

Tool Use & Structured Outputs

The Failure

A customer support agent could explain the refund policy in beautiful detail. But when a user said "refund my last order," the agent responded with instructions to visit the refund page and fill out a form. The user was talking to an AI agent specifically to avoid filling out forms. The conversation felt like calling a company and being told to check the website.

The agent could talk about actions. It could not take them. This is the gap that tool use fills. When the model can call functions -- look up an order, process a refund, check inventory -- the conversation becomes genuinely useful. Without tools, your agent is a search bar with personality.

Tool Use in Chat: The AI SDK Pattern

The AI SDK defines tools as functions the model can decide to call. You describe the tool's purpose and its parameters using a Zod schema. The model decides when to invoke it, and your code executes the function.

import { streamText } from 'ai';
import { google } from '@ai-sdk/google';
import { z } from 'zod';

const result = streamText({
  model: google('gemini-2.5-flash'),
  system: systemPrompt,
  messages: modelMessages,
  tools: {
    searchKnowledge: {
      description: 'Search the knowledge base for information about projects, work, or expertise.',
      parameters: z.object({
        query: z.string().describe('The search query'),
      }),
      execute: async ({ query }) => {
        const docs = await searchDatabase(query);
        return JSON.stringify(docs);
      },
    },
    getCurrentWeather: {
      description: 'Get current weather for a location',
      parameters: z.object({
        location: z.string().describe('City name or coordinates'),
        unit: z.enum(['celsius', 'fahrenheit']).optional(),
      }),
      execute: async ({ location, unit }) => {
        const weather = await fetchWeather(location, unit);
        return JSON.stringify(weather);
      },
    },
  },
  maxSteps: 5, // Allow up to 5 tool calls per response
});

Three things matter:

The description tells the model when to use the tool. "Search the knowledge base for information about projects" is specific. "Search for stuff" is not. Description quality directly affects invocation accuracy.
The parameters schema validates input. Zod schemas enforce types at runtime. If the model sends malformed parameters, the schema catches it before your function runs.
The return value goes back to the model. The tool result becomes part of the conversation context. The model uses it to formulate its response to the user.

The maxSteps parameter controls the agentic loop. The model can call a tool, read the result, call another tool, and keep going until it has enough information to respond -- or until it hits the step limit.

Tool Execution in Voice

Tools work the same way conceptually in voice agents, but the UX is fundamentally different. When a chat agent calls a tool, you can show a loading indicator. When a voice agent calls a tool, there is silence.

Here is the same knowledge base search tool in a LiveKit voice agent:

import { llm } from '@livekit/agents';
import { z } from 'zod';

const tools = {
  search: llm.tool({
    description: 'Search the knowledge base for information about projects, work, or expertise.',
    parameters: z.object({
      query: z.string().describe('The search query'),
    }),
    execute: async ({ query }) => {
      const docs = await retrieveContext(query);
      if (docs.length === 0) {
        return 'No specific information found for this query.';
      }
      return docs.map((d) => d.content).join('\n\n');
    },
  }),
};

The API surface is nearly identical. The difference is latency sensitivity:

| Context | Acceptable Tool Latency | User Experience During Wait | |---------|------------------------|----------------------------| | Chat | Up to 3 seconds | Loading spinner, "Searching..." indicator | | Voice | Under 500ms | Silence -- feels like the agent froze |

Strategies for voice tool latency:

Filler responses: "Let me look that up for you..." before the tool call.
Preemptive generation: Start generating the next response while the tool executes.
Fast tools only: Keep voice-facing tools under 500ms. Move slow operations to background tasks.

Agentic Loop Control

For complex workflows, you need fine-grained control over which tools are available at each step and when the loop should stop.

import { streamText, stepCountIs } from 'ai';

const result = streamText({
  model: google('gemini-2.5-flash'),
  messages,
  tools: myTools,
  maxSteps: 10,
  stopWhen: stepCountIs(3), // Stop after 3 steps
});

For more dynamic control, stopWhen accepts a function and prepareStep lets you change available tools per step:

const result = streamText({
  model: google('gemini-2.5-flash'),
  messages,
  tools: myTools,
  maxSteps: 10,
  stopWhen: (event) => {
    // Stop after a specific tool is called
    if (event.type === 'tool-result' &&
        event.toolName === 'submitOrder') {
      return true;
    }
    return false;
  },
  prepareStep: async (event) => {
    // After 3 steps, only allow the final submission tool
    if (event.stepNumber > 3) {
      return { tools: { submitOrder: myTools.submitOrder } };
    }
    return {};
  },
});

stopWhen halts the loop based on conditions -- useful for workflows where a specific tool call means "we are done." prepareStep changes the available tools at each step -- useful for guided flows where the agent should not skip ahead.

Structured Outputs

Structured outputs force the model to return data in a specific shape, validated against a schema. This is different from tool use -- here you are constraining the model's final response, not giving it functions to call.

import { generateObject } from 'ai';
import { google } from '@ai-sdk/google';
import { z } from 'zod';

const schema = z.object({
  sentiment: z.enum(['positive', 'negative', 'neutral']),
  confidence: z.number().min(0).max(1),
  topics: z.array(z.string()).max(5),
  summary: z.string().max(200),
});

const { object } = await generateObject({
  model: google('gemini-2.5-flash'),
  schema,
  prompt: `Analyze this customer message: "${userMessage}"`,
});

// object is fully typed:
// { sentiment: 'positive', confidence: 0.87, topics: ['pricing'], summary: '...' }

The model's output is guaranteed to match the schema. No parsing, no regex, no "please format your response as JSON." The AI SDK handles constraint enforcement at the protocol level.

Combining Tools and Structured Outputs

The real power comes from combining both: the agent calls tools to gather information, then returns a structured response.

const result = streamText({
  model: google('gemini-2.5-flash'),
  messages,
  tools: {
    lookupUser: {
      description: 'Look up user information by email',
      parameters: z.object({ email: z.string().email() }),
      execute: async ({ email }) => {
        return JSON.stringify(await db.users.findByEmail(email));
      },
    },
    checkSubscription: {
      description: 'Check subscription status',
      parameters: z.object({ userId: z.string() }),
      execute: async ({ userId }) => {
        return JSON.stringify(await db.subscriptions.get(userId));
      },
    },
  },
  maxSteps: 3,
});

The model might first call lookupUser, then checkSubscription with the returned user ID, then synthesize both results into a human-readable response. This is the agentic pattern -- the model reasons about which tools to call and in what order.

Schema Design Best Practices

Use .describe() on every field. The description helps the model understand what each field means.
Use enums over free-form strings when the set of valid values is known.
Set reasonable limits with .max(), .min(), .length() to prevent runaway outputs.
Make optional fields explicit with .optional().
Keep schemas focused. One schema per concern.

// Good: descriptive, constrained
z.object({
  priority: z.enum(['low', 'medium', 'high', 'critical'])
    .describe('How urgent this issue is'),
  estimatedMinutes: z.number().min(1).max(480)
    .describe('Estimated time to resolve in minutes'),
  category: z.string().max(50)
    .describe('The support category this falls under'),
});

Build This

Add tool use to the streaming chat you built in Lesson 3:

Define a knowledge base search tool with a Zod schema. The tool should query your database or a local JSON file and return results.
Add the tool to your streamText call with maxSteps: 3.
On the client, handle tool invocation states in the message parts. Show a "Searching..." indicator when part.type === 'tool-invocation' and part.state === 'call'.
Test with a query that requires the tool and one that does not. Verify the model only calls the tool when relevant.
Bonus: Add stopWhen: stepCountIs(3) and observe how it affects multi-step reasoning.

Key Takeaways

Tools transform agents from text generators into action-takers. Define clear descriptions and typed parameters.
Tool latency matters differently in chat vs. voice. Voice needs sub-500ms tools or filler responses.
maxSteps controls the agentic loop. Use stopWhen and prepareStep for fine-grained control.
Structured outputs guarantee typed data -- no parsing, no regex, no hoping the model formats correctly.
Schema quality determines output quality. Use .describe(), enums, and constraints.
Combine tools and structured outputs for agents that gather data, reason about it, and return reliable results.

What's Next

You have a streaming chat agent that can call functions and return structured data. Now we cross the modality boundary. Next, we cover WebRTC and the OpenAI Realtime API -- how to build voice agents that process audio end-to-end with sub-second latency, delivered over peer-to-peer connections.