Graceful Degradation When APIs Fail | Production AI Architecture | Celestinosalim.com

Graceful Degradation When APIs Fail

The Certainty of Failure

Every external API you depend on will fail. This is not pessimism -- it is operational reality. Anthropic, OpenAI, Google, and every other LLM provider have experienced multi-hour outages. Rate limits will be hit during traffic spikes. Network partitions will sever connections. Models will be deprecated with insufficient migration time.

The question is not whether your AI system will face a failure. The question is whether your users will notice.

In hardware engineering, systems are designed for graceful degradation as a core requirement. A well-designed power system does not go from "fully operational" to "completely dark." It sheds non-critical loads, switches to backup power, dims non-essential lighting, and maintains life-safety systems. Each degradation step is designed, tested, and documented.

I architect AI systems with the same philosophy. Every failure scenario has a pre-planned response that maintains the most valuable functionality while shedding the least critical features.

The Degradation Hierarchy

I design every AI feature with a four-tier degradation hierarchy. The system moves down tiers automatically as failures accumulate, and recovers upward as services restore:

TIER 1: FULL CAPABILITY
├── Primary model available
├── All features active
├── Real-time responses
└── Full personalization

    ↓ (primary model timeout or error)

TIER 2: REDUCED CAPABILITY
├── Fallback model active
├── Core features only
├── Slightly higher latency
└── Standard (non-personalized) responses

    ↓ (all model providers unavailable)

TIER 3: CACHED/STATIC RESPONSES
├── Pre-computed answers for common queries
├── Template-based responses
├── No generative capability
└── "We're experiencing high demand" messaging

    ↓ (cache unavailable or query has no cached answer)

TIER 4: HUMAN ESCALATION
├── Queue to human agent
├── Self-service documentation links
├── Estimated wait time
└── Contact form fallback

The critical design principle: each tier is a complete, usable experience. Tier 3 is not an error page -- it is a deliberately designed experience that handles the most common user needs without any AI model availability.

Circuit Breakers: The Automatic Failover Mechanism

A circuit breaker monitors the health of a dependency and automatically "trips" when failure rates exceed a threshold. It prevents the system from repeatedly calling a service that is down, which would waste time, accumulate costs, and create a poor user experience.

CIRCUIT BREAKER STATE MACHINE
═════════════════════════════

  ┌─────────┐   failure threshold   ┌─────────┐
  │ CLOSED  │ ──────────────────── ►│  OPEN   │
  │ (normal)│                       │ (tripped)│
  └────┬────┘                       └────┬────┘
       │                                 │
       │  ◄─── success ───              │ cooldown timer expires
       │                    │            │
       │               ┌────┴────┐      │
       │               │HALF-OPEN│◄─────┘
       │               │ (probe) │
       │               └─────────┘
       │                    │
       └──── success ──────┘

CLOSED:    All requests pass through normally.
           Failures are counted.

OPEN:      All requests are immediately routed
           to fallback. No calls to the failing
           service. Cooldown timer starts.

HALF-OPEN: After cooldown, one probe request is
           sent. If it succeeds, return to CLOSED.
           If it fails, return to OPEN.

Here is my production implementation:

class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED'
  private failureCount = 0
  private lastFailureTime = 0
  private readonly failureThreshold: number
  private readonly cooldownMs: number
  private readonly monitorWindowMs: number

  constructor(config: CircuitBreakerConfig) {
    this.failureThreshold = config.failureThreshold ?? 5
    this.cooldownMs = config.cooldownMs ?? 30_000
    this.monitorWindowMs = config.monitorWindowMs ?? 60_000
  }

  async execute<T>(
    primaryFn: () => Promise<T>,
    fallbackFn: () => Promise<T>
  ): Promise<T> {
    if (this.state === 'OPEN') {
      if (this.shouldProbe()) {
        this.state = 'HALF_OPEN'
        // Fall through to try primary
      } else {
        return fallbackFn()
      }
    }

    try {
      const result = await primaryFn()
      this.onSuccess()
      return result
    } catch (error) {
      this.onFailure()
      return fallbackFn()
    }
  }

  private onSuccess(): void {
    this.failureCount = 0
    this.state = 'CLOSED'
  }

  private onFailure(): void {
    this.failureCount++
    this.lastFailureTime = Date.now()
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN'
      emit('circuit_breaker.opened', {
        provider: this.providerId,
        failures: this.failureCount
      })
    }
  }

  private shouldProbe(): boolean {
    return Date.now() - this.lastFailureTime > this.cooldownMs
  }
}

Tuning Circuit Breaker Parameters

These parameters are not one-size-fits-all. I tune them based on the provider and use case:

| Parameter | Low-latency UI | Batch Processing | Background Tasks | |-----------|---------------|------------------|-----------------| | Failure threshold | 3 | 10 | 20 | | Cooldown period | 15 seconds | 60 seconds | 5 minutes | | Monitor window | 30 seconds | 5 minutes | 15 minutes | | Timeout per call | 5 seconds | 30 seconds | 120 seconds |

A user-facing chatbot needs to fail fast (3 failures, 15-second cooldown). A batch processing pipeline can tolerate more failures before switching because each failure does not impact a waiting user.

Retry Strategies: When to Try Again

Not all failures warrant a retry. I categorize failures into three buckets:

Retryable failures: Network timeouts, 429 (rate limit), 503 (service unavailable). These are transient and likely to resolve. Retry with exponential backoff.

Non-retryable failures: 400 (bad request), 401 (auth failure), 404 (model not found). These will not resolve with a retry. Fail fast and escalate.

Ambiguous failures: 500 (internal server error), connection reset. Retry once, then failover to fallback provider.

async def retry_with_backoff(
    fn,
    max_retries=3,
    base_delay=1.0,
    max_delay=30.0,
    retryable_errors=(429, 503, "timeout")
):
    for attempt in range(max_retries + 1):
        try:
            return await fn()
        except Exception as e:
            error_type = classify_error(e)

            if error_type not in retryable_errors:
                raise  # Non-retryable, fail immediately

            if attempt == max_retries:
                raise  # Exhausted retries

            delay = min(
                base_delay * (2 ** attempt) + random.uniform(0, 1),
                max_delay
            )
            log.warning(
                f"Retry {attempt + 1}/{max_retries} "
                f"after {delay:.1f}s: {error_type}"
            )
            await asyncio.sleep(delay)

The jitter (random.uniform(0, 1)) is essential. Without it, multiple clients that fail simultaneously will all retry at the same time, creating a thundering herd that overwhelms the recovering service.

The Cached Response Strategy

Tier 3 degradation relies on having a cache of pre-computed responses for common queries. I build this cache proactively, not reactively:

class DegradedModeCache:
    """Pre-computed responses for when all LLM providers
    are unavailable. Updated weekly from production traffic
    analysis."""

    def __init__(self):
        self.exact_cache = {}      # Exact query matches
        self.semantic_cache = None  # Embedding-based similarity
        self.template_responses = {}  # Category-based templates

    def get_response(self, query: str) -> Optional[str]:
        # Try exact match first (fastest)
        if query_hash(query) in self.exact_cache:
            return self.exact_cache[query_hash(query)]

        # Try semantic similarity (accurate)
        if self.semantic_cache:
            match = self.semantic_cache.find_similar(
                query, threshold=0.92
            )
            if match:
                return match.response

        # Fall back to category template
        category = self.classify_query(query)
        if category in self.template_responses:
            return self.template_responses[category]

        return None  # Cannot serve -- escalate to Tier 4

I populate this cache by analyzing the top 500 most common queries from production traffic weekly. For most B2B applications, this covers 60-80% of incoming queries. Your users get an answer, and they likely will not even notice the AI is running in degraded mode.

Testing Degradation: Chaos Engineering for AI

You cannot trust a degradation path you have never tested. I run monthly "failure drills" that deliberately trigger each degradation tier:

Kill the primary provider. Block API calls to your primary LLM. Verify Tier 2 activates within your latency budget.
Kill all providers. Block all external LLM APIs. Verify Tier 3 serves cached responses.
Overload the system. Send 10x normal traffic. Verify rate limiting and circuit breakers engage correctly.
Corrupt the cache. Clear the degraded mode cache. Verify Tier 4 human escalation path works.

Document the results. Fix the gaps. Run it again next month.

When Full Degradation Architecture Is Overkill

The four-tier hierarchy is designed for customer-facing, revenue-critical AI systems. Not every deployment justifies the full investment:

| System Type | Recommended Tiers | Why | |-------------|-------------------|-----| | Customer-facing product (always-on expectation) | All four tiers | Users expect availability. Downtime is revenue loss and trust erosion. | | Internal tool (business hours, tolerant users) | Tier 1 + Tier 2 + clear error messaging | A friendly "service unavailable, try again in 10 minutes" is often sufficient. | | Batch processing pipeline | Tier 1 + retry queue + alerting | Failed items can be reprocessed. Build a dead-letter queue, not a real-time fallback. | | Prototype or experiment | None -- just log the errors | Invest in degradation architecture when the system earns production status. |

The irreversible decision here is choosing not to build Tier 3 (cached responses) for a customer-facing product. If your provider has a multi-hour outage and you have no cached fallback, your product is down for hours. That trust cost is not recoverable by deploying the cache afterward.

Architecture Review Checklist

Before considering your degradation architecture production-ready:

[ ] Four-tier degradation hierarchy designed, with each tier delivering a usable experience
[ ] Circuit breakers configured per provider with thresholds tuned to the use case (UI vs. batch vs. background)
[ ] Failure classification implemented: retryable (429, 503, timeout), non-retryable (400, 401, 404), ambiguous (500)
[ ] Exponential backoff with jitter implemented for retryable failures
[ ] Cached response layer populated from production traffic analysis (top 500+ queries)
[ ] Human escalation path tested end-to-end (Tier 4)
[ ] Monthly failure drills scheduled: primary kill, all-provider kill, overload, cache corruption
[ ] Recovery path tested: system automatically promotes back to higher tiers when services restore
[ ] Latency budget verified: fallback path completes within acceptable user-facing latency

Key Takeaways

Design a four-tier degradation hierarchy: full capability, reduced capability, cached responses, human escalation. Each tier must be a complete, usable experience.
Circuit breakers prevent cascading failures by automatically routing around failing providers. Tune parameters per use case.
Categorize failures into retryable, non-retryable, and ambiguous. Use exponential backoff with jitter for retryable failures.
Build a cached response layer proactively from production traffic analysis. It covers 60-80% of common queries.
Test degradation monthly through deliberate failure injection. An untested fallback is not a fallback.

The best AI systems are not the ones that never fail. They are the ones where failure is invisible to the user.

What's Next

Your system now handles failures gracefully. But how do you know when something is degrading before it fails? The next lesson builds the observability stack -- traces, metrics, and evaluations -- that gives you visibility into the health, cost, and quality of every AI interaction in real time.