Start Lesson
Every external API you depend on will fail. This is not pessimism -- it is operational reality. Anthropic, OpenAI, Google, and every other LLM provider have experienced multi-hour outages. Rate limits will be hit during traffic spikes. Network partitions will sever connections. Models will be deprecated with insufficient migration time.
The question is not whether your AI system will face a failure. The question is whether your users will notice.
In hardware engineering, systems are designed for graceful degradation as a core requirement. A well-designed power system does not go from "fully operational" to "completely dark." It sheds non-critical loads, switches to backup power, dims non-essential lighting, and maintains life-safety systems. Each degradation step is designed, tested, and documented.
I architect AI systems with the same philosophy. Every failure scenario has a pre-planned response that maintains the most valuable functionality while shedding the least critical features.
I design every AI feature with a four-tier degradation hierarchy. The system moves down tiers automatically as failures accumulate, and recovers upward as services restore:
TIER 1: FULL CAPABILITY
├── Primary model available
├── All features active
├── Real-time responses
└── Full personalization
↓ (primary model timeout or error)
TIER 2: REDUCED CAPABILITY
├── Fallback model active
├── Core features only
├── Slightly higher latency
└── Standard (non-personalized) responses
↓ (all model providers unavailable)
TIER 3: CACHED/STATIC RESPONSES
├── Pre-computed answers for common queries
├── Template-based responses
├── No generative capability
└── "We're experiencing high demand" messaging
↓ (cache unavailable or query has no cached answer)
TIER 4: HUMAN ESCALATION
├── Queue to human agent
├── Self-service documentation links
├── Estimated wait time
└── Contact form fallback
The critical design principle: each tier is a complete, usable experience. Tier 3 is not an error page -- it is a deliberately designed experience that handles the most common user needs without any AI model availability.
A circuit breaker monitors the health of a dependency and automatically "trips" when failure rates exceed a threshold. It prevents the system from repeatedly calling a service that is down, which would waste time, accumulate costs, and create a poor user experience.
CIRCUIT BREAKER STATE MACHINE
═════════════════════════════
┌─────────┐ failure threshold ┌─────────┐
│ CLOSED │ ──────────────────── ►│ OPEN │
│ (normal)│ │ (tripped)│
└────┬────┘ └────┬────┘
│ │
│ ◄─── success ─── │ cooldown timer expires
│ │ │
│ ┌────┴────┐ │
│ │HALF-OPEN│◄─────┘
│ │ (probe) │
│ └─────────┘
│ │
└──── success ──────┘
CLOSED: All requests pass through normally.
Failures are counted.
OPEN: All requests are immediately routed
to fallback. No calls to the failing
service. Cooldown timer starts.
HALF-OPEN: After cooldown, one probe request is
sent. If it succeeds, return to CLOSED.
If it fails, return to OPEN.
Here is my production implementation:
class CircuitBreaker {
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED'
private failureCount = 0
private lastFailureTime = 0
private readonly failureThreshold: number
private readonly cooldownMs: number
private readonly monitorWindowMs: number
constructor(config: CircuitBreakerConfig) {
this.failureThreshold = config.failureThreshold ?? 5
this.cooldownMs = config.cooldownMs ?? 30_000
this.monitorWindowMs = config.monitorWindowMs ?? 60_000
}
async execute<T>(
primaryFn: () => Promise<T>,
fallbackFn: () => Promise<T>
): Promise<T> {
if (this.state === 'OPEN') {
if (this.shouldProbe()) {
this.state = 'HALF_OPEN'
// Fall through to try primary
} else {
return fallbackFn()
}
}
try {
const result = await primaryFn()
this.onSuccess()
return result
} catch (error) {
this.onFailure()
return fallbackFn()
}
}
private onSuccess(): void {
this.failureCount = 0
this.state = 'CLOSED'
}
private onFailure(): void {
this.failureCount++
this.lastFailureTime = Date.now()
if (this.failureCount >= this.failureThreshold) {
this.state = 'OPEN'
emit('circuit_breaker.opened', {
provider: this.providerId,
failures: this.failureCount
})
}
}
private shouldProbe(): boolean {
return Date.now() - this.lastFailureTime > this.cooldownMs
}
}
These parameters are not one-size-fits-all. I tune them based on the provider and use case:
| Parameter | Low-latency UI | Batch Processing | Background Tasks | |-----------|---------------|------------------|-----------------| | Failure threshold | 3 | 10 | 20 | | Cooldown period | 15 seconds | 60 seconds | 5 minutes | | Monitor window | 30 seconds | 5 minutes | 15 minutes | | Timeout per call | 5 seconds | 30 seconds | 120 seconds |
A user-facing chatbot needs to fail fast (3 failures, 15-second cooldown). A batch processing pipeline can tolerate more failures before switching because each failure does not impact a waiting user.
Not all failures warrant a retry. I categorize failures into three buckets:
Retryable failures: Network timeouts, 429 (rate limit), 503 (service unavailable). These are transient and likely to resolve. Retry with exponential backoff.
Non-retryable failures: 400 (bad request), 401 (auth failure), 404 (model not found). These will not resolve with a retry. Fail fast and escalate.
Ambiguous failures: 500 (internal server error), connection reset. Retry once, then failover to fallback provider.
async def retry_with_backoff(
fn,
max_retries=3,
base_delay=1.0,
max_delay=30.0,
retryable_errors=(429, 503, "timeout")
):
for attempt in range(max_retries + 1):
try:
return await fn()
except Exception as e:
error_type = classify_error(e)
if error_type not in retryable_errors:
raise # Non-retryable, fail immediately
if attempt == max_retries:
raise # Exhausted retries
delay = min(
base_delay * (2 ** attempt) + random.uniform(0, 1),
max_delay
)
log.warning(
f"Retry {attempt + 1}/{max_retries} "
f"after {delay:.1f}s: {error_type}"
)
await asyncio.sleep(delay)
The jitter (random.uniform(0, 1)) is essential. Without it, multiple clients that fail simultaneously will all retry at the same time, creating a thundering herd that overwhelms the recovering service.
Tier 3 degradation relies on having a cache of pre-computed responses for common queries. I build this cache proactively, not reactively:
class DegradedModeCache:
"""Pre-computed responses for when all LLM providers
are unavailable. Updated weekly from production traffic
analysis."""
def __init__(self):
self.exact_cache = {} # Exact query matches
self.semantic_cache = None # Embedding-based similarity
self.template_responses = {} # Category-based templates
def get_response(self, query: str) -> Optional[str]:
# Try exact match first (fastest)
if query_hash(query) in self.exact_cache:
return self.exact_cache[query_hash(query)]
# Try semantic similarity (accurate)
if self.semantic_cache:
match = self.semantic_cache.find_similar(
query, threshold=0.92
)
if match:
return match.response
# Fall back to category template
category = self.classify_query(query)
if category in self.template_responses:
return self.template_responses[category]
return None # Cannot serve -- escalate to Tier 4
I populate this cache by analyzing the top 500 most common queries from production traffic weekly. For most B2B applications, this covers 60-80% of incoming queries. Your users get an answer, and they likely will not even notice the AI is running in degraded mode.
You cannot trust a degradation path you have never tested. I run monthly "failure drills" that deliberately trigger each degradation tier:
Document the results. Fix the gaps. Run it again next month.
The four-tier hierarchy is designed for customer-facing, revenue-critical AI systems. Not every deployment justifies the full investment:
| System Type | Recommended Tiers | Why | |-------------|-------------------|-----| | Customer-facing product (always-on expectation) | All four tiers | Users expect availability. Downtime is revenue loss and trust erosion. | | Internal tool (business hours, tolerant users) | Tier 1 + Tier 2 + clear error messaging | A friendly "service unavailable, try again in 10 minutes" is often sufficient. | | Batch processing pipeline | Tier 1 + retry queue + alerting | Failed items can be reprocessed. Build a dead-letter queue, not a real-time fallback. | | Prototype or experiment | None -- just log the errors | Invest in degradation architecture when the system earns production status. |
The irreversible decision here is choosing not to build Tier 3 (cached responses) for a customer-facing product. If your provider has a multi-hour outage and you have no cached fallback, your product is down for hours. That trust cost is not recoverable by deploying the cache afterward.
Before considering your degradation architecture production-ready:
The best AI systems are not the ones that never fail. They are the ones where failure is invisible to the user.
Your system now handles failures gracefully. But how do you know when something is degrading before it fails? The next lesson builds the observability stack -- traces, metrics, and evaluations -- that gives you visibility into the health, cost, and quality of every AI interaction in real time.