Runbooks, Decision Records & Deploy Confidence | Production AI Architecture | Celestinosalim.com

Runbooks, Decision Records & Deploy Confidence

The Deploy Confidence Problem

Here is the scenario that inspired this course's tagline. It is Friday at 3 PM. You have a fix for a production issue. The question is: do you deploy?

In most organizations running AI systems, the answer is "wait until Monday." The team lacks confidence that a deployment will not introduce a regression, that they will detect it if it does, or that they can roll back quickly.

This is an operational maturity failure. The infrastructure from previous lessons -- guardrails, circuit breakers, observability, vendor off-ramps -- is necessary but not sufficient. What turns infrastructure into confidence is documentation: runbooks, decision records, and release checklists that make deployments boring.

In production engineering, the most reliable systems are not the ones with the best hardware. They are the ones with the best documentation. The engineer who maintains the system at 3 AM is not the one who designed it. They need documents that assume no prior context.

Runbooks: The 3 AM Engineering Manual

A runbook is a step-by-step procedure for handling a specific operational scenario. It is written for the engineer who has been woken up at 3 AM, is operating on limited sleep, and needs to resolve an issue without breaking something else.

The Runbook Structure

Every runbook in my systems follows this template:

# RUNBOOK: [Scenario Name]
Last updated: 2026-02-25
Owner: @celestino

## Symptoms
What does this look like? What alerts fire?
What do users report?

## Impact
What is affected? What is the blast radius?
What is the severity? (P1/P2/P3/P4)

## Diagnosis Steps
1. Check [specific dashboard URL]
2. Run [specific command]
3. Look for [specific pattern in logs]

## Resolution Steps
### Option A: [Most common fix]
1. Step-by-step instructions
2. With exact commands
3. And expected outputs

### Option B: [Alternative fix]
1. If Option A did not resolve
2. Different approach

## Rollback Procedure
1. How to undo the resolution
2. If it made things worse

## Escalation
- If unresolved after 30 minutes: page @team-lead
- If customer-impacting for 1+ hour: notify @support-lead
- If cost impact > $X: notify @engineering-manager

## Post-Incident
- Create incident report
- Update this runbook if steps were unclear

AI-Specific Runbooks I Maintain

Every production AI system I build ships with at minimum three runbooks:

1. Primary LLM Provider Outage. Symptoms: circuit breaker OPEN alert, error rate spike. Diagnosis: check provider status page, verify fallback is receiving traffic. Resolution: circuit breaker handles automatic failover; if fallback is also degraded, enable cached response mode via config flag and monitor until provider restores.

2. Cost Spike Alert. Symptoms: hourly cost exceeds 2x threshold. Diagnosis: identify which model and use case is spiking, check for retry storms, traffic spikes, or prompt injection inflating tokens. Resolution: tighten circuit breaker sensitivity for retry storms, enable aggressive model routing for traffic spikes, enable strict input validation for injection attacks.

3. Quality Degradation Detected. Symptoms: weekly eval scores declined, user feedback shifted negative. Diagnosis: sample 20 recent low-scoring traces, check if model version, system prompt, or RAG corpus changed. Resolution: pin to previous model version or revert prompt changes, run full eval suite, document findings in an ADR.

Architecture Decision Records: The "Why" Documentation

An Architecture Decision Record (ADR) captures a significant technical decision, its context, the alternatives considered, and the consequences. It is the document that prevents the new engineer from asking "why did we do it this way?" and getting the answer "nobody remembers."

For AI systems, ADRs are especially critical because the landscape shifts rapidly. A decision that made sense six months ago may need revisiting, and the ADR tells you whether the original constraints still apply.

The ADR Template for AI Systems

I use a modified version of the Michael Nygard format, extended with AI-specific fields:

# ADR-[NUMBER]: [Decision Title]
Date: 2026-02-25
Status: Accepted | Superseded by ADR-XX | Deprecated

## Context
What forces and constraints are at play?

## Decision
What is the decision? Be specific.

## Alternatives Considered
For each: pros, cons, estimated cost.

## Consequences
Positive, negative, and risks.
What triggers a revisit of this decision?

## AI-Specific Fields
- Models affected: [list]
- Cost impact: [estimate]
- Quality impact: [eval baseline reference]
- Vendor dependency change: [yes/no]
- Review date: [when to revisit]

ADRs I Write for Every AI System

These are the decisions that every production AI system must document:

ADR-001: Primary Model Selection. Why this model over alternatives. Cost comparison, quality benchmarks, and the conditions that would trigger a switch.

ADR-002: Vendor Off-Ramp Strategy. The gateway architecture, fallback chain, and tested provider alternatives.

ADR-003: Guardrail Configuration. What guardrails are active, their thresholds, and the incidents that informed each one.

ADR-004: Cost Architecture. Model routing tiers, caching strategy, budget limits, and the unit economics model.

ADR-005: Observability Stack. Tool selection, metric definitions, alert thresholds, and the evaluation cadence.

Each of these ADRs has a review date. I revisit them quarterly, because the AI landscape changes faster than most decision assumptions.

The Release Checklist: Making Deploys Boring

The goal of a release checklist is to make deployments routine. Not exciting, not nerve-wracking -- boring. Boring deployments are safe deployments.

Here is the checklist I use for AI system releases:

## AI System Release Checklist

### Pre-Deploy
- [ ] All eval suites pass (quality scores >= baseline)
- [ ] Cost estimate reviewed (no unexpected token increase)
- [ ] Prompt changes tested against all provider adapters
- [ ] Guardrail test suite passes (including adversarial tests)
- [ ] Rollback procedure documented and tested
- [ ] On-call engineer identified and briefed

### Deploy
- [ ] Deploy to staging environment
- [ ] Run smoke tests (5 representative queries)
- [ ] Check observability dashboard (no anomalies)
- [ ] Deploy to production (canary: 5% traffic)
- [ ] Monitor for 15 minutes:
      - Error rate stable
      - Latency within bounds
      - Cost per interaction within bounds
      - No guardrail spike
- [ ] Promote to 100% traffic
- [ ] Monitor for 30 minutes at full traffic

### Post-Deploy
- [ ] Verify all dashboard metrics normal
- [ ] Run automated eval on production traffic sample
- [ ] Update ADR if this deploy changes architecture decisions
- [ ] Update runbooks if this deploy changes operational procedures
- [ ] Notify team of successful deployment

The canary deployment (5% traffic) is non-negotiable for AI systems. Unlike traditional software where a bug produces an error, an AI regression produces subtly wrong outputs that only become visible at scale. The canary gives you a detection window.

Putting It All Together: The Operations Manual

Every production AI system I architect ships with an operations manual containing six sections:

System Overview -- Architecture diagrams, component datasheets (from Lesson 2), and the vendor off-ramp topology.
Runbooks -- Step-by-step procedures for provider outages, cost spikes, quality degradation, and guardrail bypasses.
Architecture Decision Records -- The numbered ADR log covering model selection, vendor strategy, guardrail configuration, cost architecture, and observability stack.
Release Checklist -- The pre-deploy, deploy, and post-deploy procedure that makes deployments boring.
On-Call Guide -- Dashboard URLs, alert definitions, escalation paths, and the "first 5 minutes" protocol for each alert severity.
Quarterly Review Agenda -- ADR validity check, cost optimization review, quality baseline assessment, vendor alternative evaluation, and runbook accuracy audit.

This manual is a living document. Every incident updates the relevant runbook. Every architectural change produces an ADR. Every deployment follows the checklist.

The Friday Deploy Test

Here is how you know your operational maturity is sufficient: can you deploy on Friday afternoon and go home without anxiety?

If yes, it means:

Your observability will catch regressions before users report them
Your circuit breakers will failover automatically if something breaks
Your runbooks will guide the on-call engineer to resolution
Your rollback procedure is tested and takes under 5 minutes
Your guardrails will prevent dangerous outputs even in degraded states

This is not recklessness. It is confidence built on systems engineering discipline. It is the difference between a prototype and a product.

Architecture Review Checklist

Before declaring your AI system operationally mature:

[ ] At least three runbooks written: provider outage, cost spike, and quality degradation
[ ] Each runbook includes symptoms, diagnosis steps, resolution options, rollback procedure, and escalation path
[ ] Runbooks tested by an engineer who did not write them (the "3 AM test")
[ ] ADRs documented for all five core decisions: model selection, vendor strategy, guardrails, cost architecture, observability
[ ] Every ADR has a review date and a "what triggers revisiting this decision" section
[ ] Release checklist covers pre-deploy, canary deploy, and post-deploy verification
[ ] Canary deployment at 5% traffic is the standard, not the exception
[ ] Rollback procedure documented and tested -- takes under 5 minutes to execute
[ ] Operations manual assembled with all six sections and accessible to the full team
[ ] Quarterly review scheduled on the team calendar

Key Takeaways

Runbooks are written for the 3 AM engineer with no context. Include exact commands, expected outputs, and escalation paths.
Architecture Decision Records capture the "why" and include a review date. Revisit quarterly for AI systems.
Release checklists make deployments boring. Canary deployments (5% traffic) are non-negotiable for AI releases.
Every production AI system ships with an operations manual: overview, runbooks, ADRs, release process, on-call guide, and quarterly review agenda.
The Friday Deploy Test is the ultimate measure of operational maturity. If you cannot deploy on Friday, you have gaps to fill.

This is where Hardened AI lives -- not in the model selection, not in the prompt engineering, but in the operational discipline that makes everything sustainable. Systems thinking, all the way down.

Course Conclusion

Over eight lessons, you have built a complete architectural playbook for production AI systems:

Lesson 1 established the systems thinking mindset that separates robust architecture from fragile prototypes.
Lesson 2 gave you the LLM Datasheet practice -- treating every model as an engineered component with documented specs and failure modes.
Lesson 3 modeled the true cost of AI features and gave you four cost optimization strategies ordered by impact.
Lesson 4 built the vendor off-ramp -- the three-layer gateway architecture that protects your business from vendor lock-in.
Lesson 5 layered in five levels of guardrails and financial safety valves.
Lesson 6 designed the four-tier degradation hierarchy that keeps your system useful even when providers fail.
Lesson 7 built the observability stack -- traces, metrics, and evaluations -- that makes everything visible.
Lesson 8 wrapped it all in operational documentation that turns infrastructure into team-wide confidence.

The thread connecting every lesson is the same: architecture is about the decisions you can reverse and the ones you cannot. The patterns in this course make more decisions reversible and protect you when they are not.

The teams that apply this discipline deploy on Fridays, sleep through the night, and iterate faster than teams that skip it -- because confidence is a force multiplier. That is the promise of Hardened AI, and now you have the playbook to deliver it.