Evals Are the Unit Tests of AI
We don't deploy code without tests. Why are we deploying
AI without evals?
Every backend engineer I know would refuse to merge a PR
without test coverage. You write the feature, you write the
test, you watch it pass in CI, you ship. Nobody applauds you
for having unit tests; they question your judgment if you
don't.
And yet, across the industry, teams are shipping LLM-powered
features to production with nothing but a gut feeling. Someone
opens the playground, types a few prompts, scans the output,
and says "looks good." The feature goes live, and the team
crosses its fingers.
I spent two years replacing finger-crossing with engineering.
The same discipline that made traditional software reliable --
automated testing with clear pass/fail criteria -- works for
AI systems too. The tools are different. The mental model is
the same.
This is the playbook: why "vibe checks" fail, what to
measure, how to build your first eval harness, and how to
wire it into CI/CD so it actually gets used.
The Vibe Check Anti-Pattern
Here is a pattern I have seen on every AI team that later ran
into production problems.
The team builds a RAG pipeline or a chat feature. They test it
manually -- the same three or four prompts they have been
using since the prototype. The output reads well. Someone
senior says "ship it." Two weeks later, support tickets roll
in. The model is hallucinating policy details. It cites
documents that don't exist. It confidently gives wrong answers
to questions that were never in the test set.
I call this the Vibe Check Anti-Pattern: evaluating a
non-deterministic system with a deterministic mindset. You
checked five inputs and they looked fine, so you assumed all
inputs would look fine. That is the equivalent of testing
your API with one GET request and declaring the whole service
production-ready.
Vibe checks fail for four structural reasons:
- LLMs are non-deterministic. The same prompt produces
different outputs across runs. A single manual check tells
you almost nothing about the distribution of possible
responses.
- Prompt changes cascade unpredictably. You tweak the
system prompt to fix one edge case, and three other cases
regress. Without automated coverage, you won't know until
a user reports it.
- Edge cases surface at scale. Your five test prompts
represent your imagination. Production represents thousands
of users with thousands of phrasings. The gap between
those two sets is where failures live.
- Human review doesn't scale. Even if you check twenty
examples before every deploy, that is a tiny fraction of
the input space. And human attention degrades. By example
fifteen, you are skimming.
The vibe check feels safe because it is familiar. We did the
same thing before testing frameworks existed for traditional
code. We moved past that era for good reason.
What Evals Actually Measure
If evals are the unit tests of AI, what are the assertions?
In traditional testing, you assert that a function returns the
right value, handles edge cases, and doesn't throw unexpected
errors. AI evals are analogous, but adapted for probabilistic
outputs.
I organize evals across five dimensions:
Faithfulness (Your assertEqual)
Does the output stay true to the provided context? If your
RAG system retrieves a document saying refunds are available
within 30 days, and the model tells the user 60 days, that is
a faithfulness failure. It does not matter how fluent or
helpful the response sounds. It is wrong.
Faithfulness is non-negotiable. It is your assertEqual.
Relevance (Your Integration Test)
Does the output actually address the user's question? A
response can be perfectly faithful to the context but
completely miss the point. The user asks about pricing, and
the model gives a faithful summary of the company's founding
story. Technically correct, practically useless.
Relevance evals check that retrieval, prompt construction,
and generation are working together. That is your integration
test.
Completeness (Your Coverage Check)
Did the output include all the important information? Partial
answers erode trust fast. If the refund policy has three
conditions and the model only mentions one, that is an
incomplete response even if it is faithful and relevant.
Latency (Your Performance Test)
How long did the full pipeline take? A chatbot that takes
twelve seconds to respond has already lost the conversation.
I track p50, p95, and p99 latency across the entire
pipeline -- retrieval, reranking, generation -- not just the
LLM call.
Cost (Your Unit Economics Test)
What did that response cost to produce? This is the one most
teams skip, and it is the one that kills products. If your
average response costs $0.12 in API calls and your margin per
interaction is $0.08, you have a feature that loses money on
every request. I track cost-per-response as a first-class
eval metric because reliability without viable unit economics
is a path to a product that works but cannot survive.
Building Your First Eval Harness
Enough theory. Here is how I build these in practice: a
Python eval harness that starts simple and escalates to
LLM-as-judge scoring.
Step 1: Define Your Test Cases
Think of these like pytest fixtures -- structured inputs
with expected properties:
# eval_cases.py
EVAL_CASES = [
{
"input": "What is the refund policy?",
"context": "Refunds are available within 30 days of purchase. "
"Original receipt required. No refunds on digital goods.",
"expected_substrings": ["30 days", "receipt"],
"expected_not_present": ["60 days", "no refund policy"],
"tags": ["policy", "factuality"],
},
{
"input": "How do I contact support?",
"context": "Support is available via email at help@example.com "
"or by phone at 1-800-555-0199, Mon-Fri 9am-5pm EST.",
"expected_substrings": ["help@example.com", "1-800-555-0199"],
"expected_not_present": ["24/7"],
"tags": ["support", "factuality"],
},
]
Step 2: Build a Simple Assertion-Based Runner
This is the most basic eval: deterministic checks against LLM output. It won't catch everything, but it catches the obvious regressions:
# eval_runner.py
from openai import OpenAI
client = OpenAI()
def run_llm(prompt: str, context: str) -> str:
"""Call the LLM with a constrained system prompt."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Answer the user's question using ONLY the "
"provided context. If the context doesn't "
"contain the answer, say so explicitly."
f"\n\nContext:\n{context}"
),
},
{"role": "user", "content": prompt},
],
temperature=0,
)
return response.choices[0].message.content
def eval_deterministic(cases):
"""Run substring assertions: fast, cheap, catches regressions."""
results = []
for case in cases:
output = run_llm(case["input"], case["context"])
passed = all(
s.lower() in output.lower()
for s in case["expected_substrings"]
)
no_hallucination = all(
s.lower() not in output.lower()
for s in case["expected_not_present"]
)
results.append({
"input": case["input"],
"output": output,
"passed": passed and no_hallucination,
"tags": case["tags"],
})
return results
This catches about 60% of what you need. It is fast, cheap to
run, and requires no additional LLM calls. Start here.
Step 3: Add LLM-as-Judge for Nuanced Scoring
Substring matching does not capture tone, completeness, or
whether the answer is actually helpful. For that, I use a
second LLM as a judge -- the same pattern that evaluation
frameworks like DeepEval and RAGAS use under the hood:
import json
def judge_faithfulness(
question: str, context: str, answer: str
) -> dict:
"""Score faithfulness using a separate LLM as judge."""
rubric = (
"You are an evaluation judge. Score the ANSWER's "
"faithfulness to the CONTEXT on a scale of 0.0 to 1.0.\n\n"
"Rules:\n"
"- 1.0 = every claim in the answer is supported by context\n"
"- 0.5 = some claims supported, some unsupported\n"
"- 0.0 = answer contradicts or fabricates beyond context\n\n"
f"QUESTION: {question}\n"
f"CONTEXT: {context}\n"
f"ANSWER: {answer}\n\n"
'Respond with ONLY valid JSON: '
'{"score": <float>, "reason": "<one sentence>"}'
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": rubric}],
temperature=0,
)
return json.loads(response.choices[0].message.content)
Step 4: Wire It Into a Pass/Fail Gate
Now combine both approaches into a single suite that exits non-zero on failure, so your CI pipeline treats it exactly like a failing test:
def run_eval_suite(cases, threshold=0.85):
"""Run full eval suite. Exit non-zero if below threshold."""
results = []
for case in cases:
output = run_llm(case["input"], case["context"])
judgment = judge_faithfulness(
case["input"], case["context"], output
)
results.append({
"input": case["input"],
"score": judgment["score"],
"reason": judgment["reason"],
"passed": judgment["score"] >= threshold,
})
pass_rate = sum(
1 for r in results if r["passed"]
) / len(results)
print(f"\nEval Results: {pass_rate:.0%} pass rate")
print(f"Threshold: {threshold:.0%}")
for r in results:
status = "PASS" if r["passed"] else "FAIL"
print(f" [{status}] {r['input']}")
print(f" Score: {r['score']}, {r['reason']}")
if pass_rate < threshold:
raise SystemExit(
f"Eval FAILED: {pass_rate:.0%} < {threshold:.0%}"
)
return results
if __name__ == "__main__":
from eval_cases import EVAL_CASES
run_eval_suite(EVAL_CASES)
Run it with python eval_runner.py. If the suite fails, your
deploy stops. That is the whole point.
The Reliability Flywheel
Here is where this stops being about testing and starts being
about growth.
When I introduced hard evaluation harnesses to replace manual
"vibe checks" on a content retrieval system, the immediate
effect was predictable: we caught regressions before users
did. Hallucinations dropped. Responses got more accurate.
The second-order effect changed the business: user trust
increased, and impressions lifted by 482%.
Not a typo. When people trust the output, they use the system
more. When they use it more, they share it. When they share
it, impressions compound. Reliability created a flywheel
that no amount of feature work could have produced.
This is the argument I make to every stakeholder who asks
why we are "wasting time" on evals instead of building
features: reliability is the feature. Users do not adopt
AI products because of capabilities. They adopt them because
they trust the output. Trust is measurable. That is what
evals give you.
The flywheel works like this:
- Measure faithfulness, relevance, and completeness with automated evals.
- Enforce thresholds: no deploy if the score drops below the baseline.
- Observe the improvement in user engagement metrics as trust builds.
- Collect the new edge cases that production traffic reveals.
- Add those cases to your eval suite and repeat.
Each cycle makes the system more reliable, which makes users
more trusting, which drives more usage, which reveals more
edge cases, which makes the next cycle even more valuable.
This is Systems Thinking applied to AI quality. The feedback
loop compounds.
Frameworks and Tools
You do not have to build everything from scratch. The
evaluation ecosystem has matured, and there are strong
options depending on your needs.
RAGAS
Best for: RAG-specific pipelines. RAGAS provides metrics
purpose-built for retrieval systems: faithfulness, answer
relevancy, context precision, and context recall. Python-
native, lightweight, and plugs directly into LangChain,
LlamaIndex, or Haystack pipelines. If your primary concern
is "did the retriever pull the right documents and did the
generator use them faithfully," start here.
DeepEval
Best for: Teams that want a pytest-like experience.
DeepEval offers 60+ metrics and is designed to feel like
writing backend tests. Define test cases, run them with
deepeval test run, get pass/fail results in your terminal.
Fully CI/CD compatible. Each metric tells you why the
score is what it is. If your team already lives in pytest,
DeepEval has the lowest adoption friction.
promptfoo
Best for: Prompt iteration and red-teaming. promptfoo
takes a declarative YAML approach ideal for comparing prompt
variants, running A/B tests, and catching security issues.
Here is a basic config:
# promptfooconfig.yaml
description: "Support bot faithfulness eval"
providers:
- openai:gpt-4o
prompts:
- "Answer using ONLY this context: {{context}}\n\nQ: {{question}}"
tests:
- vars:
question: "What is the refund policy?"
context: "Refunds available within 30 days."
assert:
- type: contains
value: "30 days"
- type: llm-rubric
value: "Answer is faithful to the provided context"
- type: latency
threshold: 3000
Run it with npx promptfoo eval and you get a comparison
table in your terminal. No Python required. I use promptfoo
for rapid prompt iteration during development and the Python
harness for CI gates.
Custom Harnesses
Sometimes the off-the-shelf metrics do not capture what
matters for your domain. A legal AI needs different
faithfulness criteria than a customer support bot. When that
happens, build a custom judge (like the one above) with
domain-specific rubrics. The frameworks give you scaffolding;
your domain expertise gives you the assertions that
actually matter.
The Cultural Shift
The hardest part of AI evaluation is not technical. It is
cultural.
Most AI teams treat evals as a one-time validation exercise.
Build the eval suite during initial development, run it to
prove the system works, show the results to stakeholders,
then never touch it again. The eval suite goes stale. New
prompts ship without coverage. Regressions creep in quietly.
This is the same anti-pattern traditional software went
through before CI/CD became standard practice. We solved it
by making tests a gate, not a report. The same principle
applies here.
Here is what I enforce on teams I work with:
- Every PR that touches a prompt includes eval results.
No exceptions. If you changed the system prompt, show me
the before/after scores. Same as requiring test coverage
for new code paths.
- Model migrations are gated on eval pass rates. Upgrading
from GPT-4o to GPT-4.5? Run the full eval suite first. I
have seen model "upgrades" cause 15% faithfulness
regressions because the new model was more verbose and less
precise. The eval caught it before users did.
- Eval cases grow from production incidents. Every time a
user reports a bad response, that becomes a new eval case.
Your test suite should be a living record of every failure
mode you have encountered. Over time, it becomes your most
valuable artifact -- more valuable than the prompt itself.
- Dashboards, not spreadsheets. Track eval scores over
time so you can spot trends. A slow drift downward in
faithfulness is easier to catch on a chart than in a weekly
manual review.
The teams that treat evals as infrastructure -- always
running, always growing, always gating deploys -- are the
teams that ship AI features with confidence. They deploy on
Fridays. They swap models without fear. They iterate on
prompts knowing they have a safety net.
That is what production AI actually looks like. Not
bulletproof models. Those do not exist. Systems with
guardrails that catch failures before users do, feedback
loops that compound quality over time, and engineering
discipline that treats reliability as the foundation rather
than an afterthought.
We stopped deploying code without tests a long time ago. It
is time we held AI to the same standard.