Evals Are the Unit Tests of AI
We don't deploy code without tests. Why are we deploying AI without evals? A practical guide to building evaluation harnesses that make AI systems reliable enough to trust.
Evals Are the Unit Tests of AI
We don't deploy code without tests. Why are we deploying AI without evals?
Every backend engineer I know would refuse to merge a PR without test coverage. We've internalized this as a profession. You write the feature, you write the test, you watch it pass in CI, you ship. It's not glamorous. It's the floor. Nobody applauds you for having unit tests; they question your judgment if you don't.
And yet, across the industry, teams are shipping LLM-powered features to production with nothing but a gut feeling. Someone opens the playground, types a few prompts, scans the output, and says "looks good." That's the entire quality assurance process. The feature goes live, and the team crosses its fingers.
I've spent the last two years replacing finger-crossing with engineering. What I've found is straightforward: the same discipline that made traditional software reliable — automated testing with clear pass/fail criteria — works for AI systems too. The tools are different. The mental model is the same.
This post is the playbook I wish I'd had when I started. I'll walk through why "vibe checks" fail, what to measure, how to build your first eval harness, and how to wire it into CI/CD so it actually gets used.
The Vibe Check Anti-Pattern
Let me describe a pattern I've seen on every AI team that later ran into production problems.
The team builds a RAG pipeline or a chat feature. They test it manually by typing in a handful of prompts — usually the same three or four they've been using since the prototype. The output reads well. Someone senior says "ship it." Two weeks later, support tickets start rolling in. The model is hallucinating policy details. It's citing documents that don't exist. It confidently gives wrong answers to questions that were never in the test set.
I call this the Vibe Check Anti-Pattern: evaluating a non-deterministic system with a deterministic mindset. You checked five inputs and they looked fine, so you assumed all inputs would look fine. That's the equivalent of testing your API with one GET request and declaring the whole service production-ready.
Here's why vibe checks fail structurally:
- LLMs are non-deterministic. The same prompt can produce different outputs across runs. A single manual check tells you almost nothing about the distribution of possible responses.
- Prompt changes cascade unpredictably. You tweak the system prompt to fix one edge case, and three other cases regress. Without automated coverage, you won't know until a user reports it.
- Edge cases surface at scale. Your five test prompts represent your imagination. Production represents thousands of users with thousands of ways to phrase things. The gap between those two sets is where failures live.
- Human review doesn't scale. Even if you're disciplined enough to check twenty examples before every deploy, that's still a tiny fraction of the input space. And human attention degrades — by example fifteen, you're skimming.
The vibe check feels safe because it's familiar. It's what we did before we had testing frameworks for traditional code, too. But we moved past that era for good reason.
What Evals Actually Measure
If evals are the unit tests of AI, what are the assertions? In traditional testing, you assert that a function returns the right value, handles edge cases, and doesn't throw unexpected errors. AI evals are analogous, but adapted for probabilistic outputs.
I organize evals across five dimensions:
Faithfulness (The Core Assertion)
Does the output stay true to the provided context? This is the AI equivalent of "does the function return the correct value." If your RAG system retrieves a document saying refunds are available within 30 days, and the model tells the user 60 days, that's a faithfulness failure. It doesn't matter how fluent or helpful the response sounds — it's wrong.
Faithfulness is non-negotiable. It's your assertEqual.
Relevance (The Integration Test)
Does the output actually address the user's question? A response can be perfectly faithful to the context but completely miss the point. The user asks about pricing, and the model gives a faithful summary of the company's founding story. Technically correct, practically useless.
Relevance evals check that the system's components — retrieval, prompt construction, and generation — are working together correctly. That's your integration test.
Completeness (The Coverage Check)
Did the output include all the important information? Partial answers erode trust quickly. If the refund policy has three conditions and the model only mentions one, that's an incomplete response even if it's faithful and relevant.
Latency (The Performance Test)
How long did the full pipeline take? Users have expectations. A chatbot that takes twelve seconds to respond has already lost the conversation. I track p50, p95, and p99 latency across the entire pipeline — retrieval, reranking, generation — not just the LLM call.
Cost (The Unit Economics Test)
What did that response cost to produce? This is the one most teams skip, and it's the one that kills products. If your average response costs $0.12 in API calls, and your margin per user interaction is $0.08, you have a profitable-sounding feature that is actually losing money on every request. I track cost-per-response as a first-class eval metric because reliability without viable unit economics is a path to a product that works but can't survive.
Building Your First Eval Harness
Enough theory. Here's how I build these in practice. I'll walk through a Python eval harness that starts simple and escalates to LLM-as-judge scoring.
Step 1: Define Your Test Cases
Think of these like pytest fixtures — structured inputs with expected properties:
# eval_cases.py
EVAL_CASES = [
{
"input": "What is the refund policy?",
"context": "Refunds are available within 30 days of purchase. "
"Original receipt required. No refunds on digital goods.",
"expected_substrings": ["30 days", "receipt"],
"expected_not_present": ["60 days", "no refund policy"],
"tags": ["policy", "factuality"],
},
{
"input": "How do I contact support?",
"context": "Support is available via email at help@example.com "
"or by phone at 1-800-555-0199, Mon-Fri 9am-5pm EST.",
"expected_substrings": ["help@example.com", "1-800-555-0199"],
"expected_not_present": ["24/7"],
"tags": ["support", "factuality"],
},
]
Step 2: Build a Simple Assertion-Based Runner
This is the most basic eval — deterministic checks against LLM output. It won't catch everything, but it catches the obvious regressions:
# eval_runner.py
from openai import OpenAI
client = OpenAI()
def run_llm(prompt: str, context: str) -> str:
"""Call the LLM with a constrained system prompt."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Answer the user's question using ONLY the "
"provided context. If the context doesn't "
"contain the answer, say so explicitly."
f"\n\nContext:\n{context}"
),
},
{"role": "user", "content": prompt},
],
temperature=0,
)
return response.choices[0].message.content
def eval_deterministic(cases):
"""Run substring assertions — fast, cheap, catches regressions."""
results = []
for case in cases:
output = run_llm(case["input"], case["context"])
passed = all(
s.lower() in output.lower()
for s in case["expected_substrings"]
)
no_hallucination = all(
s.lower() not in output.lower()
for s in case["expected_not_present"]
)
results.append({
"input": case["input"],
"output": output,
"passed": passed and no_hallucination,
"tags": case["tags"],
})
return results
This catches about 60% of what you need. It's fast, cheap to run, and requires no additional LLM calls. Start here.
Step 3: Add LLM-as-Judge for Nuanced Scoring
Substring matching doesn't capture tone, completeness, or whether the answer is actually helpful. For that, I use a second LLM as a judge — the same pattern that evaluation frameworks like DeepEval and RAGAS use under the hood:
import json
def judge_faithfulness(
question: str, context: str, answer: str
) -> dict:
"""Score faithfulness using a separate LLM as judge."""
rubric = (
"You are an evaluation judge. Score the ANSWER's "
"faithfulness to the CONTEXT on a scale of 0.0 to 1.0.\n\n"
"Rules:\n"
"- 1.0 = every claim in the answer is supported by context\n"
"- 0.5 = some claims supported, some unsupported\n"
"- 0.0 = answer contradicts or fabricates beyond context\n\n"
f"QUESTION: {question}\n"
f"CONTEXT: {context}\n"
f"ANSWER: {answer}\n\n"
'Respond with ONLY valid JSON: '
'{"score": <float>, "reason": "<one sentence>"}'
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": rubric}],
temperature=0,
)
return json.loads(response.choices[0].message.content)
Step 4: Wire It Into a Pass/Fail Gate
Now combine both approaches into a single suite that exits non-zero on failure — so your CI pipeline treats it exactly like a failing test:
def run_eval_suite(cases, threshold=0.85):
"""Run full eval suite. Exit non-zero if below threshold."""
results = []
for case in cases:
output = run_llm(case["input"], case["context"])
judgment = judge_faithfulness(
case["input"], case["context"], output
)
results.append({
"input": case["input"],
"score": judgment["score"],
"reason": judgment["reason"],
"passed": judgment["score"] >= threshold,
})
pass_rate = sum(
1 for r in results if r["passed"]
) / len(results)
print(f"\nEval Results: {pass_rate:.0%} pass rate")
print(f"Threshold: {threshold:.0%}")
for r in results:
status = "PASS" if r["passed"] else "FAIL"
print(f" [{status}] {r['input']}")
print(f" Score: {r['score']}, {r['reason']}")
if pass_rate < threshold:
raise SystemExit(
f"Eval FAILED: {pass_rate:.0%} < {threshold:.0%}"
)
return results
if __name__ == "__main__":
from eval_cases import EVAL_CASES
run_eval_suite(EVAL_CASES)
Run it with python eval_runner.py. If the suite fails, your deploy stops. That's the whole point.
The Reliability Flywheel
Here's where this stops being about testing and starts being about growth.
When I introduced hard evaluation harnesses to replace manual "vibe checks" on a content retrieval system, the immediate effect was predictable: we caught regressions before users did. Hallucinations dropped. Responses got more accurate.
But the second-order effect was the one that changed the business: user trust increased, and impressions lifted by 482%.
That's not a typo. When people trust the output, they use the system more. When they use it more, they share it. When they share it, impressions compound. Reliability created a flywheel that no amount of feature work could have produced.
This is the argument I make to every stakeholder who asks why we're "wasting time" on evals instead of building features: reliability is the feature. Users don't adopt AI products because of capabilities — they adopt them because they trust the output. And trust is measurable. That's what evals give you.
The flywheel works like this:
- Measure faithfulness, relevance, and completeness with automated evals.
- Enforce thresholds — no deploy if the score drops below the baseline.
- Observe the improvement in user engagement metrics as trust builds.
- Collect the new edge cases that production traffic reveals.
- Add those cases to your eval suite and repeat.
Each cycle makes the system more reliable, which makes users more trusting, which drives more usage, which reveals more edge cases, which makes the next cycle even more valuable. This is Systems Thinking applied to AI quality — the feedback loop compounds.
Frameworks and Tools
You don't have to build everything from scratch. The evaluation ecosystem has matured significantly, and there are strong options depending on your needs.
RAGAS
Best for: RAG-specific pipelines. RAGAS (Retrieval-Augmented Generation Assessment) provides metrics purpose-built for retrieval systems — faithfulness, answer relevancy, context precision, and context recall. It's Python-native, lightweight, and plugs directly into LangChain, LlamaIndex, or Haystack pipelines. If your primary concern is "did the retriever pull the right documents and did the generator use them faithfully," RAGAS is where I'd start.
DeepEval
Best for: Teams that want a pytest-like experience. DeepEval offers 60+ metrics and is designed to feel like writing backend tests. You define test cases, run them with deepeval test run, and get pass/fail results in your terminal. It's fully CI/CD compatible and self-explaining — each metric tells you why the score is what it is. If your team already lives in pytest, DeepEval has the lowest adoption friction.
promptfoo
Best for: Prompt iteration and red-teaming. promptfoo takes a declarative YAML approach that's ideal for comparing prompt variants, running A/B tests, and catching security issues. Here's what a basic config looks like:
# promptfooconfig.yaml
description: "Support bot faithfulness eval"
providers:
- openai:gpt-4o
prompts:
- "Answer using ONLY this context: {{context}}\n\nQ: {{question}}"
tests:
- vars:
question: "What is the refund policy?"
context: "Refunds available within 30 days."
assert:
- type: contains
value: "30 days"
- type: llm-rubric
value: "Answer is faithful to the provided context"
- type: latency
threshold: 3000
Run it with npx promptfoo eval and you get a comparison table in your terminal. No Python required. I use promptfoo for rapid prompt iteration during development and the Python harness for CI gates.
Custom Harnesses
Sometimes the off-the-shelf metrics don't capture what matters for your domain. A legal AI needs different faithfulness criteria than a customer support bot. When that happens, build a custom judge (like the one above) with domain-specific rubrics. The frameworks give you scaffolding; your domain expertise gives you the assertions that actually matter.
The Cultural Shift
The hardest part of AI evaluation isn't technical — it's cultural.
Most AI teams treat evals as a one-time validation exercise. You build the eval suite during the initial development phase, run it to prove the system works, show the results to stakeholders, and then never touch it again. The eval suite becomes stale. New prompts ship without eval coverage. Regressions creep in quietly.
This is the same anti-pattern traditional software went through before CI/CD became standard practice. We solved it there by making tests a gate, not a report. The same principle applies here.
Here's what I enforce on teams I work with:
- Every PR that touches a prompt includes eval results. No exceptions. If you changed the system prompt, show me the before/after scores. This is the same as requiring test coverage for new code paths.
- Model migrations are gated on eval pass rates. Upgrading from GPT-4o to GPT-4.5? Run the full eval suite first. I've seen model "upgrades" cause 15% faithfulness regressions because the new model was more verbose and less precise. The eval caught it before users did.
- Eval cases grow from production incidents. Every time a user reports a bad response, that becomes a new eval case. Your test suite should be a living record of every failure mode you've encountered. Over time, it becomes your most valuable artifact — more valuable than the prompt itself.
- Dashboards, not spreadsheets. Track eval scores over time so you can spot trends. A slow drift downward in faithfulness is easier to catch on a chart than in a weekly manual review.
The teams that treat evals as infrastructure — always running, always growing, always gating deploys — are the teams that ship AI features with confidence. They deploy on Fridays. They swap models without fear. They iterate on prompts knowing they have a safety net.
That's what Hardened AI means in practice. Not bulletproof models — those don't exist. But systems with guardrails that catch failures before users do, feedback loops that compound quality over time, and engineering discipline that treats reliability as the foundation rather than an afterthought.
We stopped deploying code without tests a long time ago. It's time we held AI to the same standard.
