Systems Thinking for AI Engineers

Software is fragile. Systems are robust.

I keep coming back to that line. It captures something that the AI industry still hasn't internalized, even as we race to ship agents, copilots, and retrieval pipelines into production.

Here is the pattern I see repeated: an engineer builds a prototype. The LLM is impressive. The demo lands. Stakeholders nod. The team pushes to production. And then — the API times out on a Thursday night. The model hallucinates a legal citation. The monthly bill arrives at three times the forecast. The system doesn't fail gracefully. It just fails.

The problem was never the model. The problem was that nobody designed the system.

I have spent years thinking about why this keeps happening, and I believe the answer is deceptively simple: most AI engineers think in features, not in systems. They optimize the prompt but ignore the fallback. They benchmark the model but never test what happens when the model is unavailable. They celebrate the happy path and never map the failure modes.

This essay is about a different way of thinking. One that I learned from 8+ years of shipping software at scale — where uptime is non-negotiable and "it works on my machine" is not a deployment strategy.

The Hardware Engineering Lens

The best analogy I have found for AI systems comes from hardware engineering — a world where components overheat, signals degrade, and power supplies fluctuate. Hardware engineering teaches that every component in a system is trying to fail. Your job as an engineer is to design the system so that when individual parts fail — and they will — the whole thing keeps working.

That mindset has shaped everything about how I approach AI systems. Here are three analogies borrowed from hardware that I apply every day:

Voltage Regulators are Guardrails. A voltage regulator takes an unpredictable input voltage — noisy, fluctuating, sometimes spiking — and clamps it to a stable output range. Without one, your downstream components fry. LLM guardrails do exactly the same thing. They take the unpredictable output of a language model — sometimes brilliant, sometimes confabulated, occasionally toxic — and constrain it to an acceptable range. Both accept variable input, both produce bounded output, and both dissipate the excess. A voltage regulator sheds extra energy as heat. A guardrail sheds hallucinated content as rejected tokens. And critically, both have a design limit. Push past it, and the protection fails. Knowing that limit is what separates engineering from guesswork.

Signal-to-Noise Ratio is Hallucination Rate. In signal processing, SNR measures how much useful signal exists relative to background noise. A high SNR means the information is clean and reliable. A low SNR means you are hearing static more than substance. Every AI system has its own SNR. The "signal" is factually grounded, contextually relevant output. The "noise" is hallucinations, irrelevant tangents, and confabulated details. Better retrieval improves the signal. Better prompts filter the noise. But here is the part most people miss: you can also reduce noise at the source by constraining the input. In hardware, you would use a bandpass filter to eliminate frequencies outside your range of interest. In AI, you constrain the context window to only the most relevant documents. Same principle. Different medium.

Circuit Breakers are Fallback Patterns. A physical circuit breaker trips when current exceeds a safe threshold. It sacrifices availability of a single circuit to protect the building from fire. Software circuit breakers do the same: when an API's error rate crosses a threshold, the breaker trips, the system stops calling the failing service, and a fallback takes over. This prevents a single failing component from cascading through the entire system. The principle is simple but easy to forget: one unprotected failure can cascade and take out everything downstream. Every external dependency in my AI systems gets a circuit breaker now. Every single one.

The Five Properties of a Hardened System

Through building and breaking enough systems, I have arrived at five properties that separate fragile software from hardened infrastructure. These are not theoretical. They are the properties I evaluate in every production system I touch.

1. Redundancy. No single point of failure. If your entire AI feature depends on one API from one provider, you don't have a system — you have a bet. Redundancy means multiple LLM providers with automatic failover. It means cached embeddings for your most common queries so that when the embedding service goes dark, 60% of traffic is still served. It means your retrieval layer can fall back from semantic search to keyword search without the user ever seeing an error page.

2. Defined Failure States. Every component must have a known, tested failure mode. Not "it might crash" — but "when this component returns a 503, the system will respond with X." I document failure states the way datasheets document operating limits. If you cannot tell me exactly what happens when your LLM provider returns a 429, your system is not ready for production.

3. Observability. You cannot fix what you cannot see. You cannot degrade gracefully if you cannot detect failure. This means logging latency, token usage, and error rates per request. It means alerts for cost anomalies, not just error spikes. It means being able to replay a failed request from your logs to understand exactly where the pipeline broke. Observability is not a feature you add later. It is the foundation you build on first.

4. Graceful Degradation. When something breaks, the system gets worse — not broken. This is the difference between "search results are slightly less relevant right now" and "500 Internal Server Error." Graceful degradation requires that you have already thought through the reduced-capability modes. What does the feature look like without the AI component? If the answer is "it doesn't exist," then you have a fragility problem. Every AI feature I build has a non-AI fallback, even if it is just a static response or a redirect to a human.

5. Cost Awareness. This is the property most engineers ignore, and it is the one that kills the most projects. Unit economics are a system property, not a business concern. If your cost-per-request doubles at scale, your system has a design flaw. I track cost the same way I track latency: per request, with alerts on anomalies, with clear budgets per feature. I have seen teams build impressive AI features that were quietly burning through five figures a month because nobody put a cost ceiling on the token consumption. A system without cost awareness is a system waiting to be shut down by finance.

Applying Systems Thinking to AI

Here is the mental shift that matters: your LLM is not your system. It is one component within your system.

This sounds obvious. It is not. Most AI engineering today treats the model as the center of gravity. The entire architecture revolves around getting the best output from a single model call. Everything else — retrieval, caching, fallbacks, monitoring — is an afterthought.

Systems thinking inverts this. The model is a component with known failure modes, just like a transistor in a circuit. And just like a transistor, it needs supporting infrastructure to function reliably.

Consider the failure modes of a typical LLM-powered feature:

API Timeouts. Your provider has an outage or throttles your requests. This is not an edge case. It is a certainty on a long enough timeline.
Hallucinations. The model generates plausible but incorrect information. This is not a bug — it is a fundamental property of how language models work.
Cost Spikes. A prompt change doubles your average token consumption. A new user pattern triggers unexpectedly long outputs. Your monthly bill triples.
Model Deprecation. Your provider sunsets the model version you depend on. Your carefully tuned prompts no longer produce the same results.

Each of these is a known failure mode. None of them should surprise you. And none of them should take your system down.

The systems thinker designs for these failures upfront. Not because they are pessimistic — because they understand that the probability of at least one of these happening in production approaches 100% over time. The question is not "will it fail?" but "have I designed for the failure?"

Case Study: The Thursday Night the Embedding API Went Down

I want to share a specific story because I think it illustrates the difference between feature thinking and systems thinking better than any abstraction.

I was running a RAG-powered support system. Real users, real traffic, real expectations. On a Thursday evening around 8 PM, our embedding provider started returning intermittent 503 errors. Response times climbed from 200 milliseconds to two seconds, then to full timeouts.

Here is what did not happen: the system did not go down. Users did not see error pages. Nobody got paged.

Here is what did happen, in sequence:

Our observability layer flagged the latency increase within 90 seconds. The dashboard lit up, and alerts fired to the on-call channel. But by the time I saw the alert, the automated response was already underway.

The circuit breaker on the embedding API tripped after five consecutive failures. The system stopped attempting to call the failing service, which prevented a queue of backed-up requests from overwhelming everything downstream.

Graceful degradation activated. The retrieval layer fell back from semantic search to a pre-computed keyword index. Was it as good? No. Keyword search misses nuance. But it was functional. Users got relevant-enough results instead of a blank screen.

Meanwhile, our redundancy layer kicked in. We had cached embeddings for the 500 most frequently asked queries in a local database. For roughly 60% of incoming traffic, the experience was completely unchanged.

The user-facing message shifted from nothing to a subtle: "Results may be less precise right now." A defined failure state, communicated honestly.

Within 30 minutes, we had switched to our secondary embedding provider — a relationship we had negotiated specifically for this scenario. Full semantic search was restored.

Total downtime: zero. Degraded service window: 30 minutes. Customer complaints: none.

None of this was heroic engineering. It was boring, methodical systems thinking, applied months before the incident ever occurred. Every component had a fallback. Every failure mode had a response. The system worked precisely because we had designed it to work when things broke.

The Mental Model Shift

The industry talks a lot about "prompt engineering." I think that framing is limiting — maybe even harmful.

Prompt engineering frames the LLM as the system. Get the prompt right, and everything works. But a perfect prompt means nothing if the context window is stuffed with irrelevant documents. It means nothing if the API is down. It means nothing if the response costs ten cents per query and your margin is two cents.

The shift I am advocating for is from prompt engineering to systems engineering. The most important skill for an AI engineer is not writing a better prompt. It is designing a better system around that prompt. It is understanding that the prompt is one layer in a stack that includes retrieval, caching, guardrails, observability, fallbacks, and cost controls.

When I evaluate an AI system, I do not start by reading the prompts. I start by asking: "What happens when the model is unavailable?" The answer to that question tells me more about the system's maturity than any benchmark ever could.

This is what 8+ years of shipping software at scale taught me. Not specific knowledge about any one tool or framework — but a way of seeing. A discipline that assumes components will fail and designs the system to absorb those failures. A respect for the boring infrastructure that makes the exciting components viable.

The Checklist: Systems Thinking Before You Ship

I will leave you with the questions I ask myself — and my teams — before any AI feature goes to production. Print this out. Tape it to your monitor. Argue about it in your next architecture review.

Redundancy

What happens if your primary LLM provider is down for four hours?
Do you have a cached or static fallback for your most critical user paths?
Can you switch providers without redeploying?

Defined Failure States

Can you name every failure mode of every external dependency?
Does each failure mode have a documented, tested response?
Have you actually tested those failure responses, or just theorized about them?

Observability

Are you logging latency, token count, and error rate per request?
Do you have alerts for cost anomalies, not just errors?
Can you replay a failed request from your logs to diagnose the root cause?

Graceful Degradation

If the AI component fails, does the user still get value from the feature?
Is your degradation path tested in CI, or does it only exist in a design doc?

Cost Awareness

What is your cost-per-request at 10x your current traffic?
Do you have a kill switch if costs spike beyond your budget?
Have you modeled what happens to your unit economics when the provider raises prices by 20%?

Systems are not built by optimists. They are built by engineers who respect the ways things break — and design accordingly.

If you are building AI in production and want to talk about hardening your systems, I am always up for the conversation. Talk to my AI and see these principles in action — or explore my work for more on how I approach reliability engineering.

Software is fragile. Systems are robust. Build the system.