Start Lesson
Here is the scenario that inspired this course's tagline. It is Friday at 3 PM. You have a fix for a production issue. The question is: do you deploy?
In most organizations running AI systems, the answer is "wait until Monday." The team lacks confidence that a deployment will not introduce a regression, that they will detect it if it does, or that they can roll back quickly.
This is an operational maturity failure. The infrastructure from previous lessons -- guardrails, circuit breakers, observability, vendor off-ramps -- is necessary but not sufficient. What turns infrastructure into confidence is documentation: runbooks, decision records, and release checklists that make deployments boring.
In production engineering, the most reliable systems are not the ones with the best hardware. They are the ones with the best documentation. The engineer who maintains the system at 3 AM is not the one who designed it. They need documents that assume no prior context.
A runbook is a step-by-step procedure for handling a specific operational scenario. It is written for the engineer who has been woken up at 3 AM, is operating on limited sleep, and needs to resolve an issue without breaking something else.
Every runbook in my systems follows this template:
# RUNBOOK: [Scenario Name]
Last updated: 2026-02-25
Owner: @celestino
## Symptoms
What does this look like? What alerts fire?
What do users report?
## Impact
What is affected? What is the blast radius?
What is the severity? (P1/P2/P3/P4)
## Diagnosis Steps
1. Check [specific dashboard URL]
2. Run [specific command]
3. Look for [specific pattern in logs]
## Resolution Steps
### Option A: [Most common fix]
1. Step-by-step instructions
2. With exact commands
3. And expected outputs
### Option B: [Alternative fix]
1. If Option A did not resolve
2. Different approach
## Rollback Procedure
1. How to undo the resolution
2. If it made things worse
## Escalation
- If unresolved after 30 minutes: page @team-lead
- If customer-impacting for 1+ hour: notify @support-lead
- If cost impact > $X: notify @engineering-manager
## Post-Incident
- Create incident report
- Update this runbook if steps were unclear
Every production AI system I build ships with at minimum three runbooks:
1. Primary LLM Provider Outage. Symptoms: circuit breaker OPEN alert, error rate spike. Diagnosis: check provider status page, verify fallback is receiving traffic. Resolution: circuit breaker handles automatic failover; if fallback is also degraded, enable cached response mode via config flag and monitor until provider restores.
2. Cost Spike Alert. Symptoms: hourly cost exceeds 2x threshold. Diagnosis: identify which model and use case is spiking, check for retry storms, traffic spikes, or prompt injection inflating tokens. Resolution: tighten circuit breaker sensitivity for retry storms, enable aggressive model routing for traffic spikes, enable strict input validation for injection attacks.
3. Quality Degradation Detected. Symptoms: weekly eval scores declined, user feedback shifted negative. Diagnosis: sample 20 recent low-scoring traces, check if model version, system prompt, or RAG corpus changed. Resolution: pin to previous model version or revert prompt changes, run full eval suite, document findings in an ADR.
An Architecture Decision Record (ADR) captures a significant technical decision, its context, the alternatives considered, and the consequences. It is the document that prevents the new engineer from asking "why did we do it this way?" and getting the answer "nobody remembers."
For AI systems, ADRs are especially critical because the landscape shifts rapidly. A decision that made sense six months ago may need revisiting, and the ADR tells you whether the original constraints still apply.
I use a modified version of the Michael Nygard format, extended with AI-specific fields:
# ADR-[NUMBER]: [Decision Title]
Date: 2026-02-25
Status: Accepted | Superseded by ADR-XX | Deprecated
## Context
What forces and constraints are at play?
## Decision
What is the decision? Be specific.
## Alternatives Considered
For each: pros, cons, estimated cost.
## Consequences
Positive, negative, and risks.
What triggers a revisit of this decision?
## AI-Specific Fields
- Models affected: [list]
- Cost impact: [estimate]
- Quality impact: [eval baseline reference]
- Vendor dependency change: [yes/no]
- Review date: [when to revisit]
These are the decisions that every production AI system must document:
ADR-001: Primary Model Selection. Why this model over alternatives. Cost comparison, quality benchmarks, and the conditions that would trigger a switch.
ADR-002: Vendor Off-Ramp Strategy. The gateway architecture, fallback chain, and tested provider alternatives.
ADR-003: Guardrail Configuration. What guardrails are active, their thresholds, and the incidents that informed each one.
ADR-004: Cost Architecture. Model routing tiers, caching strategy, budget limits, and the unit economics model.
ADR-005: Observability Stack. Tool selection, metric definitions, alert thresholds, and the evaluation cadence.
Each of these ADRs has a review date. I revisit them quarterly, because the AI landscape changes faster than most decision assumptions.
The goal of a release checklist is to make deployments routine. Not exciting, not nerve-wracking -- boring. Boring deployments are safe deployments.
Here is the checklist I use for AI system releases:
## AI System Release Checklist
### Pre-Deploy
- [ ] All eval suites pass (quality scores >= baseline)
- [ ] Cost estimate reviewed (no unexpected token increase)
- [ ] Prompt changes tested against all provider adapters
- [ ] Guardrail test suite passes (including adversarial tests)
- [ ] Rollback procedure documented and tested
- [ ] On-call engineer identified and briefed
### Deploy
- [ ] Deploy to staging environment
- [ ] Run smoke tests (5 representative queries)
- [ ] Check observability dashboard (no anomalies)
- [ ] Deploy to production (canary: 5% traffic)
- [ ] Monitor for 15 minutes:
- Error rate stable
- Latency within bounds
- Cost per interaction within bounds
- No guardrail spike
- [ ] Promote to 100% traffic
- [ ] Monitor for 30 minutes at full traffic
### Post-Deploy
- [ ] Verify all dashboard metrics normal
- [ ] Run automated eval on production traffic sample
- [ ] Update ADR if this deploy changes architecture decisions
- [ ] Update runbooks if this deploy changes operational procedures
- [ ] Notify team of successful deployment
The canary deployment (5% traffic) is non-negotiable for AI systems. Unlike traditional software where a bug produces an error, an AI regression produces subtly wrong outputs that only become visible at scale. The canary gives you a detection window.
Every production AI system I architect ships with an operations manual containing six sections:
This manual is a living document. Every incident updates the relevant runbook. Every architectural change produces an ADR. Every deployment follows the checklist.
Here is how you know your operational maturity is sufficient: can you deploy on Friday afternoon and go home without anxiety?
If yes, it means:
This is not recklessness. It is confidence built on systems engineering discipline. It is the difference between a prototype and a product.
Before declaring your AI system operationally mature:
This is where Hardened AI lives -- not in the model selection, not in the prompt engineering, but in the operational discipline that makes everything sustainable. Systems thinking, all the way down.
Over eight lessons, you have built a complete architectural playbook for production AI systems:
The thread connecting every lesson is the same: architecture is about the decisions you can reverse and the ones you cannot. The patterns in this course make more decisions reversible and protect you when they are not.
The teams that apply this discipline deploy on Fridays, sleep through the night, and iterate faster than teams that skip it -- because confidence is a force multiplier. That is the promise of Hardened AI, and now you have the playbook to deliver it.