Building AI Agents That Actually Work: An Orchestration Playbook

Building AI Agents That Actually Work: An Orchestration Playbook | Celestinosalim.com

Building AI Agents That Actually Work

I shipped my first AI agent in late 2024. It answered support tickets, looked up order data, and issued refunds. The demo was flawless. Three test cases, three perfect runs. Leadership signed off. We deployed.

Within 72 hours, the agent had entered an infinite loop on a malformed order ID, burned through $400 in API calls, and attempted to refund an order that did not exist. We pulled the plug on a Friday night.

That failure taught me more about agent design than any tutorial ever did. The problem was not the LLM. It was the orchestration -- or rather, the complete lack of it.

Why Most Agent Demos Fail in Production

The typical agent tutorial looks like this: give the model a list of tools, wrap it in a while loop, and let it decide what to do. This is the ReAct pattern, and it works beautifully on toy problems.

Production is not a toy problem.

Unbounded loops. Without a step limit, the agent will happily call the same tool 47 times trying to parse a response it does not understand. I have seen this happen. The logs were 12,000 lines long.

Tool abuse. Give an agent access to a database query tool and a delete tool, and eventually it will decide that deleting a record is the fastest way to "resolve" a discrepancy. The model is optimizing for task completion, not for your business rules.

Cost spiraling. Each agent step is an LLM call. A 5-step task at $0.03 per call costs $0.15. A 15-step retry loop costs $0.45. Multiply by 10,000 daily requests and you are burning $4,500/day on a feature that was supposed to save money.

No observability. When the agent makes a bad decision at step 7 of 12, you need to know why. Most agent frameworks give you a final answer and nothing else. Good luck debugging that in production.

The Architecture That Works

After rebuilding that first agent (and two more after it), I settled on a four-stage pipeline. Every production agent I have shipped since follows this pattern.

Router → Planner → Executor → Validator

Router. Classifies the incoming request and decides which agent workflow handles it. This is a cheap, fast LLM call with structured output. No tools, no loops. Just classification. If the request does not match any known workflow, it routes to a human.

Planner. Takes the classified request and produces a step-by-step plan. The plan is a typed array of actions -- not free-form text. Each action specifies which tool to call, what arguments to pass, and what the expected output shape looks like. The planner never executes anything. It only plans.

Executor. Walks through the plan one step at a time. Calls tools, captures results, handles errors. If a step fails, the executor does not improvise. It either retries with exponential backoff or escalates to the validator.

Validator. Reviews the executor's output against the original request. Did we actually answer the question? Does the result make sense? If not, the validator can send it back to the planner for a revised plan -- but only once. No infinite replanning loops.

This separation matters because it gives you control surfaces. You can swap the planner model without touching execution. You can add a human checkpoint between planning and execution. You can cache plans for identical requests.

Tool Calling Done Right

Tools are where agents become dangerous. Every tool is an action in the real world -- an API call, a database write, a notification sent. Treat them like API endpoints, not like function arguments.

Schema Validation

Every tool gets a Zod schema. Inputs are validated before execution. Outputs are validated after.

import { z } from 'zod'

const lookupOrderSchema = {
  input: z.object({
    orderId: z.string()
      .regex(/^ORD-\d{8}$/, 'Invalid order ID format'),
    fields: z.array(
      z.enum(['status', 'total', 'items', 'shipping'])
    ).optional()
  }),
  output: z.object({
    orderId: z.string(),
    status: z.enum([
      'pending', 'shipped', 'delivered', 'cancelled'
    ]),
    total: z.number(),
    items: z.array(z.object({
      name: z.string(),
      quantity: z.number()
    }))
  })
}

When the agent tries to call lookupOrder with orderId: "just check all orders", the schema rejects it before it hits your database. No ambiguity. No creative interpretation.

Least Privilege

Not every agent workflow needs every tool. The refund agent gets lookupOrder and issueRefund. It does not get deleteOrder or modifyInventory. I define tool sets per workflow, not per agent.

const workflowTools: Record<string, Tool[]> = {
  'refund': [lookupOrder, issueRefund, notifyCustomer],
  'status-check': [lookupOrder, getShipmentTracking],
  'escalation': [lookupOrder, createTicket, notifyAgent]
}

This is the same principle as IAM roles. You would not give a read-only service account write access to your database. Do not give a status-check workflow access to mutation tools.

Error Boundaries

Tools fail. APIs time out. Databases return unexpected nulls. Every tool call gets wrapped in an error boundary that captures the failure, classifies it, and returns a structured error to the agent.

async function executeToolSafe(
  tool: Tool,
  input: unknown
): Promise<ToolResult> {
  const parsed = tool.schema.input.safeParse(input)
  if (!parsed.success) {
    return {
      status: 'validation_error',
      error: parsed.error.format(),
      retryable: false
    }
  }

  try {
    const result = await Promise.race([
      tool.execute(parsed.data),
      timeout(tool.timeoutMs ?? 5000)
    ])
    const output = tool.schema.output.safeParse(result)
    if (!output.success) {
      return {
        status: 'output_validation_error',
        error: output.error.format(),
        retryable: true
      }
    }
    return { status: 'success', data: output.data }
  } catch (err) {
    return {
      status: 'execution_error',
      error: String(err),
      retryable: isRetryable(err)
    }
  }
}

The agent never sees a raw exception. It sees a structured result with a status code, an error description, and a retryable flag. This keeps the LLM from hallucinating recovery strategies.

LangGraph Patterns for Multi-Step Workflows

I use LangGraph for agent orchestration because it makes the state machine explicit. You can see the graph. You can test individual nodes. You can replay from any checkpoint.

Here is the router-planner-executor pattern as a LangGraph workflow:

import { StateGraph, Annotation } from '@langchain/langgraph'

const AgentState = Annotation.Root({
  messages: Annotation<BaseMessage[]>({
    reducer: (a, b) => [...a, ...b],
    default: () => []
  }),
  plan: Annotation<ActionStep[]>({
    default: () => []
  }),
  currentStep: Annotation<number>({
    default: () => 0
  }),
  results: Annotation<ToolResult[]>({
    reducer: (a, b) => [...a, ...b],
    default: () => []
  }),
  totalTokens: Annotation<number>({
    default: () => 0
  }),
  status: Annotation<
    'routing' | 'planning' | 'executing' |
    'validating' | 'done' | 'failed'
  >({
    default: () => 'routing' as const
  })
})

const graph = new StateGraph(AgentState)
  .addNode('router', routerNode)
  .addNode('planner', plannerNode)
  .addNode('executor', executorNode)
  .addNode('validator', validatorNode)
  .addEdge('__start__', 'router')
  .addConditionalEdges('router', routeDecision)
  .addEdge('planner', 'executor')
  .addConditionalEdges('executor', executionDecision)
  .addConditionalEdges('validator', validationDecision)
  .compile()

The conditional edges are where the logic lives. executionDecision checks: did the current step succeed? Is there a next step? Have we hit the step limit? validationDecision checks: did the output pass validation? Have we already replanned once?

Each node is a pure function that takes state and returns a partial state update. Testing is straightforward -- pass in a state, assert on the output. No mocking LLM calls for unit tests on your orchestration logic.

Handling Retries

I implement retries at the executor level, not the graph level. The executor tracks attempt counts per step and applies exponential backoff.

async function executorNode(
  state: typeof AgentState.State
) {
  const step = state.plan[state.currentStep]
  const maxRetries = step.maxRetries ?? 3
  let attempt = 0

  while (attempt < maxRetries) {
    const result = await executeToolSafe(
      step.tool,
      step.args
    )

    if (result.status === 'success') {
      return {
        results: [result],
        currentStep: state.currentStep + 1,
        status: state.currentStep + 1
          >= state.plan.length
          ? 'validating' : 'executing'
      }
    }

    if (!result.retryable) break
    attempt++
    await sleep(Math.pow(2, attempt) * 1000)
  }

  return { status: 'failed', results: [{ 
    status: 'max_retries_exceeded',
    step: state.currentStep
  }] }
}

Cost Controls

Without cost controls, agents are a blank check. I learned this the hard way with that $400 Friday night incident. Every production agent needs three safeguards.

Token Budgets

Set a maximum token budget per agent run. Track cumulative usage across all LLM calls in the run. Kill the run when it hits the ceiling.

const COST_LIMITS = {
  maxTokensPerRun: 50_000,
  maxStepsPerRun: 10,
  maxCostPerRun: 0.50  // USD
}

function checkBudget(
  state: typeof AgentState.State
): 'continue' | 'budget_exceeded' {
  if (state.totalTokens > COST_LIMITS.maxTokensPerRun)
    return 'budget_exceeded'
  if (state.currentStep > COST_LIMITS.maxStepsPerRun)
    return 'budget_exceeded'
  return 'continue'
}

In production, my agents average 3-5 steps and 8,000-15,000 tokens per run. That works out to about $0.04-0.12 per agent run with Claude Sonnet. The ceiling at 50,000 tokens catches runaway loops without cutting off legitimate complex tasks.

Step Limits

Token budgets catch cost overruns. Step limits catch logic bugs. If your agent needs more than 10 steps for a task that should take 4, something is wrong. Do not let it keep trying. Fail fast, log the state, and route to a human.

Circuit Breakers

Monitor error rates across all agent runs. If more than 20% of runs fail in a 5-minute window, trip the circuit breaker and route all requests to a fallback (usually a human queue or a simpler deterministic workflow).

class CircuitBreaker {
  private failures = 0
  private total = 0
  private lastReset = Date.now()

  record(success: boolean) {
    this.total++
    if (!success) this.failures++

    if (Date.now() - this.lastReset > 5 * 60_000) {
      this.failures = 0
      this.total = 0
      this.lastReset = Date.now()
    }
  }

  isOpen(): boolean {
    if (this.total < 10) return false
    return this.failures / this.total > 0.2
  }
}

This saved us during a third-party API outage. The agent started failing on every order lookup, the circuit breaker tripped after 10 runs, and requests routed to a human queue within 90 seconds. Without it, we would have burned through 10,000 failed agent runs at $0.08 each -- $800 in wasted API calls before anyone noticed.

Human-in-the-Loop

Not every decision should be automated. I draw the line based on two factors: reversibility and cost.

Irreversible actions get human approval. Refunds over $100, account deletions, data exports. The agent prepares the action, presents it to a human operator with context, and waits for approval.

High-uncertainty decisions get human review. When the agent's confidence score (extracted from the validator step) drops below a threshold, it routes to a human rather than guessing.

In LangGraph, human-in-the-loop is a checkpoint. The graph pauses at a specific node and resumes when the human responds.

const graph = new StateGraph(AgentState)
  .addNode('planner', plannerNode)
  .addNode('human_review', humanReviewNode)
  .addNode('executor', executorNode)
  .addConditionalEdges('planner', (state) => {
    const needsApproval = state.plan.some(
      step => step.tool.requiresApproval
    )
    return needsApproval ? 'human_review' : 'executor'
  })
  .compile({ checkpointer })

In practice, about 15% of our agent runs hit a human checkpoint. That sounds like a lot, but those are the 15% most likely to cause damage if they go wrong. The other 85% run autonomously at an average latency of 2.3 seconds.

Measuring What Matters

You cannot improve what you do not measure. Here are the metrics I track on every production agent.

Task completion rate. The percentage of agent runs that produce a valid, verified result. Our target is 92%. We are currently at 89% and climbing. The remaining 11% route to humans, which is fine -- that is the system working as designed.

Average steps per task. Tells you if the agent is efficient or flailing. We target 3-5 steps. If the average creeps above 6, we review the planner prompts and tool schemas. Usually a vague tool description is causing the agent to try multiple tools before finding the right one.

Cost per task. We track this broken down by LLM calls, tool API calls, and infrastructure. Current numbers: $0.08 average, $0.42 P99. The P99 is high because some tasks legitimately require more steps.

Latency (P50 and P99). P50 is 2.3 seconds, P99 is 8.1 seconds. Users tolerate up to 10 seconds for complex tasks if you show progress indicators. Beyond that, you need to go async.

Fallback rate. How often the agent punts to a human. We target under 20%. If it goes higher, the agent is not pulling its weight. If it drops below 5%, we are probably auto-approving things we should not be.

interface AgentMetrics {
  runId: string
  workflow: string
  status: 'completed' | 'failed' | 'escalated'
  steps: number
  totalTokens: number
  costUsd: number
  latencyMs: number
  humanReviewRequired: boolean
  toolErrors: number
}

I send these to our observability stack (Datadog) after every run. We have dashboards, alerts on cost spikes, and weekly reviews of the worst-performing runs.

The Playbook, Summarized

Separate concerns. Router, planner, executor, validator. Each stage has one job.
Validate everything. Tool inputs, tool outputs, final results. Zod schemas are your friend.
Limit blast radius. Least-privilege tool sets, step limits, token budgets, circuit breakers.
Pause when uncertain. Human-in-the-loop for irreversible actions and low-confidence decisions.
Measure relentlessly. Completion rate, cost per task, latency, fallback rate. Every run.

The agent that failed on that Friday night was version 1. We are on version 4 now. It handles 8,000 requests per day at $0.08 average cost with a 91% autonomous completion rate. The difference is not a better model. It is better engineering around the model.

Production agents are not about making the LLM smarter. They are about making the system around the LLM predictable, observable, and safe. The model is the engine. The orchestration is the car.

Build the car.

Building AI Agents That Actually Work: An Orchestration Playbook

Discussion

Guardrails Are Not Optional: A Production Safety Implementation Guide

Observability for AI Systems: What to Log When Everything Is Probabilistic

Fine-Tuning vs RAG: The Decision Framework Nobody Talks About