Building AI Agents That Actually Work
I shipped my first AI agent in late 2024. It answered
support tickets, looked up order data, and issued refunds.
The demo was flawless. Three test cases, three perfect
runs. Leadership signed off. We deployed.
Within 72 hours, the agent had entered an infinite loop
on a malformed order ID, burned through $400 in API
calls, and attempted to refund an order that did not
exist. We pulled the plug on a Friday night.
That failure taught me more about agent design than any
tutorial ever did. The problem was not the LLM. It was
the orchestration -- or rather, the complete lack of it.
Why Most Agent Demos Fail in Production
The typical agent tutorial looks like this: give the
model a list of tools, wrap it in a while loop, and let
it decide what to do. This is the ReAct pattern, and it
works beautifully on toy problems.
Production is not a toy problem.
Unbounded loops. Without a step limit, the agent
will happily call the same tool 47 times trying to
parse a response it does not understand. I have seen
this happen. The logs were 12,000 lines long.
Tool abuse. Give an agent access to a database
query tool and a delete tool, and eventually it will
decide that deleting a record is the fastest way to
"resolve" a discrepancy. The model is optimizing for
task completion, not for your business rules.
Cost spiraling. Each agent step is an LLM call.
A 5-step task at $0.03 per call costs $0.15. A
15-step retry loop costs $0.45. Multiply by 10,000
daily requests and you are burning $4,500/day on a
feature that was supposed to save money.
No observability. When the agent makes a bad
decision at step 7 of 12, you need to know why. Most
agent frameworks give you a final answer and nothing
else. Good luck debugging that in production.
The Architecture That Works
After rebuilding that first agent (and two more after
it), I settled on a four-stage pipeline. Every
production agent I have shipped since follows this
pattern.
Router → Planner → Executor → Validator
Router. Classifies the incoming request and
decides which agent workflow handles it. This is a
cheap, fast LLM call with structured output. No tools,
no loops. Just classification. If the request does not
match any known workflow, it routes to a human.
Planner. Takes the classified request and produces
a step-by-step plan. The plan is a typed array of
actions -- not free-form text. Each action specifies
which tool to call, what arguments to pass, and what
the expected output shape looks like. The planner
never executes anything. It only plans.
Executor. Walks through the plan one step at a
time. Calls tools, captures results, handles errors.
If a step fails, the executor does not improvise. It
either retries with exponential backoff or escalates
to the validator.
Validator. Reviews the executor's output against
the original request. Did we actually answer the
question? Does the result make sense? If not, the
validator can send it back to the planner for a
revised plan -- but only once. No infinite
replanning loops.
This separation matters because it gives you control
surfaces. You can swap the planner model without
touching execution. You can add a human checkpoint
between planning and execution. You can cache plans
for identical requests.
Tool Calling Done Right
Tools are where agents become dangerous. Every tool
is an action in the real world -- an API call, a
database write, a notification sent. Treat them like
API endpoints, not like function arguments.
Schema Validation
Every tool gets a Zod schema. Inputs are validated
before execution. Outputs are validated after.
import { z } from 'zod'
const lookupOrderSchema = {
input: z.object({
orderId: z.string()
.regex(/^ORD-\d{8}$/, 'Invalid order ID format'),
fields: z.array(
z.enum(['status', 'total', 'items', 'shipping'])
).optional()
}),
output: z.object({
orderId: z.string(),
status: z.enum([
'pending', 'shipped', 'delivered', 'cancelled'
]),
total: z.number(),
items: z.array(z.object({
name: z.string(),
quantity: z.number()
}))
})
}
When the agent tries to call lookupOrder with
orderId: "just check all orders", the schema
rejects it before it hits your database. No
ambiguity. No creative interpretation.
Least Privilege
Not every agent workflow needs every tool. The
refund agent gets lookupOrder and issueRefund.
It does not get deleteOrder or modifyInventory.
I define tool sets per workflow, not per agent.
const workflowTools: Record<string, Tool[]> = {
'refund': [lookupOrder, issueRefund, notifyCustomer],
'status-check': [lookupOrder, getShipmentTracking],
'escalation': [lookupOrder, createTicket, notifyAgent]
}
This is the same principle as IAM roles. You would
not give a read-only service account write access to
your database. Do not give a status-check workflow
access to mutation tools.
Error Boundaries
Tools fail. APIs time out. Databases return
unexpected nulls. Every tool call gets wrapped in an
error boundary that captures the failure, classifies
it, and returns a structured error to the agent.
async function executeToolSafe(
tool: Tool,
input: unknown
): Promise<ToolResult> {
const parsed = tool.schema.input.safeParse(input)
if (!parsed.success) {
return {
status: 'validation_error',
error: parsed.error.format(),
retryable: false
}
}
try {
const result = await Promise.race([
tool.execute(parsed.data),
timeout(tool.timeoutMs ?? 5000)
])
const output = tool.schema.output.safeParse(result)
if (!output.success) {
return {
status: 'output_validation_error',
error: output.error.format(),
retryable: true
}
}
return { status: 'success', data: output.data }
} catch (err) {
return {
status: 'execution_error',
error: String(err),
retryable: isRetryable(err)
}
}
}
The agent never sees a raw exception. It sees a
structured result with a status code, an error
description, and a retryable flag. This keeps the
LLM from hallucinating recovery strategies.
LangGraph Patterns for Multi-Step Workflows
I use LangGraph for agent orchestration because it
makes the state machine explicit. You can see the
graph. You can test individual nodes. You can replay
from any checkpoint.
Here is the router-planner-executor pattern as a
LangGraph workflow:
import { StateGraph, Annotation } from '@langchain/langgraph'
const AgentState = Annotation.Root({
messages: Annotation<BaseMessage[]>({
reducer: (a, b) => [...a, ...b],
default: () => []
}),
plan: Annotation<ActionStep[]>({
default: () => []
}),
currentStep: Annotation<number>({
default: () => 0
}),
results: Annotation<ToolResult[]>({
reducer: (a, b) => [...a, ...b],
default: () => []
}),
totalTokens: Annotation<number>({
default: () => 0
}),
status: Annotation<
'routing' | 'planning' | 'executing' |
'validating' | 'done' | 'failed'
>({
default: () => 'routing' as const
})
})
const graph = new StateGraph(AgentState)
.addNode('router', routerNode)
.addNode('planner', plannerNode)
.addNode('executor', executorNode)
.addNode('validator', validatorNode)
.addEdge('__start__', 'router')
.addConditionalEdges('router', routeDecision)
.addEdge('planner', 'executor')
.addConditionalEdges('executor', executionDecision)
.addConditionalEdges('validator', validationDecision)
.compile()
The conditional edges are where the logic lives.
executionDecision checks: did the current step
succeed? Is there a next step? Have we hit the step
limit? validationDecision checks: did the output
pass validation? Have we already replanned once?
Each node is a pure function that takes state and
returns a partial state update. Testing is
straightforward -- pass in a state, assert on the
output. No mocking LLM calls for unit tests on your
orchestration logic.
Handling Retries
I implement retries at the executor level, not the
graph level. The executor tracks attempt counts per
step and applies exponential backoff.
async function executorNode(
state: typeof AgentState.State
) {
const step = state.plan[state.currentStep]
const maxRetries = step.maxRetries ?? 3
let attempt = 0
while (attempt < maxRetries) {
const result = await executeToolSafe(
step.tool,
step.args
)
if (result.status === 'success') {
return {
results: [result],
currentStep: state.currentStep + 1,
status: state.currentStep + 1
>= state.plan.length
? 'validating' : 'executing'
}
}
if (!result.retryable) break
attempt++
await sleep(Math.pow(2, attempt) * 1000)
}
return { status: 'failed', results: [{
status: 'max_retries_exceeded',
step: state.currentStep
}] }
}
Cost Controls
Without cost controls, agents are a blank check. I
learned this the hard way with that $400 Friday night
incident. Every production agent needs three
safeguards.
Token Budgets
Set a maximum token budget per agent run. Track
cumulative usage across all LLM calls in the run.
Kill the run when it hits the ceiling.
const COST_LIMITS = {
maxTokensPerRun: 50_000,
maxStepsPerRun: 10,
maxCostPerRun: 0.50 // USD
}
function checkBudget(
state: typeof AgentState.State
): 'continue' | 'budget_exceeded' {
if (state.totalTokens > COST_LIMITS.maxTokensPerRun)
return 'budget_exceeded'
if (state.currentStep > COST_LIMITS.maxStepsPerRun)
return 'budget_exceeded'
return 'continue'
}
In production, my agents average 3-5 steps and
8,000-15,000 tokens per run. That works out to about
$0.04-0.12 per agent run with Claude Sonnet. The
ceiling at 50,000 tokens catches runaway loops
without cutting off legitimate complex tasks.
Step Limits
Token budgets catch cost overruns. Step limits catch
logic bugs. If your agent needs more than 10 steps
for a task that should take 4, something is wrong. Do
not let it keep trying. Fail fast, log the state, and
route to a human.
Circuit Breakers
Monitor error rates across all agent runs. If more
than 20% of runs fail in a 5-minute window, trip the
circuit breaker and route all requests to a fallback
(usually a human queue or a simpler deterministic
workflow).
class CircuitBreaker {
private failures = 0
private total = 0
private lastReset = Date.now()
record(success: boolean) {
this.total++
if (!success) this.failures++
if (Date.now() - this.lastReset > 5 * 60_000) {
this.failures = 0
this.total = 0
this.lastReset = Date.now()
}
}
isOpen(): boolean {
if (this.total < 10) return false
return this.failures / this.total > 0.2
}
}
This saved us during a third-party API outage. The
agent started failing on every order lookup, the
circuit breaker tripped after 10 runs, and requests
routed to a human queue within 90 seconds. Without
it, we would have burned through 10,000 failed
agent runs at $0.08 each -- $800 in wasted API calls
before anyone noticed.
Human-in-the-Loop
Not every decision should be automated. I draw the
line based on two factors: reversibility and cost.
Irreversible actions get human approval. Refunds
over $100, account deletions, data exports. The
agent prepares the action, presents it to a human
operator with context, and waits for approval.
High-uncertainty decisions get human review. When
the agent's confidence score (extracted from the
validator step) drops below a threshold, it routes
to a human rather than guessing.
In LangGraph, human-in-the-loop is a checkpoint. The
graph pauses at a specific node and resumes when the
human responds.
const graph = new StateGraph(AgentState)
.addNode('planner', plannerNode)
.addNode('human_review', humanReviewNode)
.addNode('executor', executorNode)
.addConditionalEdges('planner', (state) => {
const needsApproval = state.plan.some(
step => step.tool.requiresApproval
)
return needsApproval ? 'human_review' : 'executor'
})
.compile({ checkpointer })
In practice, about 15% of our agent runs hit a
human checkpoint. That sounds like a lot, but those
are the 15% most likely to cause damage if they go
wrong. The other 85% run autonomously at an average
latency of 2.3 seconds.
Measuring What Matters
You cannot improve what you do not measure. Here are
the metrics I track on every production agent.
Task completion rate. The percentage of agent runs
that produce a valid, verified result. Our target is
92%. We are currently at 89% and climbing. The
remaining 11% route to humans, which is fine -- that
is the system working as designed.
Average steps per task. Tells you if the agent is
efficient or flailing. We target 3-5 steps. If the
average creeps above 6, we review the planner prompts
and tool schemas. Usually a vague tool description
is causing the agent to try multiple tools before
finding the right one.
Cost per task. We track this broken down by LLM
calls, tool API calls, and infrastructure. Current
numbers: $0.08 average, $0.42 P99. The P99 is high
because some tasks legitimately require more steps.
Latency (P50 and P99). P50 is 2.3 seconds, P99
is 8.1 seconds. Users tolerate up to 10 seconds for
complex tasks if you show progress indicators. Beyond
that, you need to go async.
Fallback rate. How often the agent punts to a
human. We target under 20%. If it goes higher, the
agent is not pulling its weight. If it drops below
5%, we are probably auto-approving things we should
not be.
interface AgentMetrics {
runId: string
workflow: string
status: 'completed' | 'failed' | 'escalated'
steps: number
totalTokens: number
costUsd: number
latencyMs: number
humanReviewRequired: boolean
toolErrors: number
}
I send these to our observability stack (Datadog)
after every run. We have dashboards, alerts on cost
spikes, and weekly reviews of the worst-performing
runs.
The Playbook, Summarized
- Separate concerns. Router, planner, executor,
validator. Each stage has one job.
- Validate everything. Tool inputs, tool outputs,
final results. Zod schemas are your friend.
- Limit blast radius. Least-privilege tool sets,
step limits, token budgets, circuit breakers.
- Pause when uncertain. Human-in-the-loop for
irreversible actions and low-confidence decisions.
- Measure relentlessly. Completion rate, cost per
task, latency, fallback rate. Every run.
The agent that failed on that Friday night was
version 1. We are on version 4 now. It handles 8,000
requests per day at $0.08 average cost with a 91%
autonomous completion rate. The difference is not a
better model. It is better engineering around the
model.
Production agents are not about making the LLM
smarter. They are about making the system around the
LLM predictable, observable, and safe. The model is
the engine. The orchestration is the car.
Build the car.