Start Lesson
In the previous lesson, I explained why vibe checks fail. Now I will show you how to build the system that replaces them. An eval suite is not a one-time test. It is a living pipeline that runs every time you change your system. By the end of this lesson, you will have the blueprint for one.
Every evaluation system, regardless of framework, reduces to three pieces:
That is it. Everything else is tooling and convenience. If you understand these three pieces, you can evaluate any AI system.
A golden dataset is a curated set of input-output pairs where you know what "good" looks like. This is the foundation. Without it, nothing else works.
What goes into a golden dataset:
interface EvalCase {
id: string;
input: string; // The user query or prompt
expectedOutput?: string; // Ideal response (if you have one)
context?: string[]; // Retrieved documents (for RAG evals)
metadata: {
category: string; // e.g., "factual", "summarization", "code"
difficulty: string; // e.g., "easy", "medium", "hard"
source: string; // Where this case came from
};
}
Where to source test cases:
| Source | Strength | Watch Out For | |---|---|---| | Production logs | Real user behavior | May contain PII | | Support tickets | Known failure modes | Selection bias toward complaints | | Domain experts | High-quality edge cases | Expensive, slow to collect | | Synthetic generation | Scale, coverage | Can miss real-world messiness | | Red-teaming sessions | Adversarial coverage | Overfits to attack patterns |
My rule of thumb: Start with 50 cases. Get 20 from production logs, 15 from known failure modes, 10 from domain experts, and 5 adversarial cases. You can grow from there, but 50 well-chosen cases will catch most regressions.
Example golden dataset rows for a customer support RAG system:
| id | input | expectedOutput | context | category | difficulty | |---|---|---|---|---|---| | CS-001 | "What is the refund policy?" | "Refunds within 30 days of purchase, processed in 5 business days" | ["Refund policy doc section 2.1"] | policy | easy | | CS-002 | "Can I get a refund after 45 days?" | "Refunds are only available within 30 days. Contact support for exceptions." | ["Refund policy doc section 2.1"] | policy | medium | | CS-003 | "I bought the enterprise plan but need to downgrade" | "Enterprise downgrades require contacting your account manager..." | ["Billing docs section 4.3", "Enterprise terms"] | billing | hard | | CS-004 | "your product sucks give me my money back" | Empathetic acknowledgment + refund policy + escalation path | ["Refund policy doc", "Customer service guidelines"] | adversarial | hard | | CS-005 | "What integrations do you support?" | "We support Slack, GitHub, Jira, and Salesforce..." | ["Integrations overview page"] | factual | easy |
Notice the range: easy factual lookups, boundary conditions (45 days vs. 30-day policy), multi-document reasoning (enterprise downgrades), and adversarial phrasing. This is the diversity that catches real failures.
Not all evals need the same scoring approach. Match your scorer to what you are measuring.
Use when there is a single correct answer.
def exact_match(expected: str, actual: str) -> float:
return 1.0 if expected.strip().lower() == actual.strip().lower() else 0.0
Best for: classification tasks, entity extraction, structured outputs.
Use when the meaning matters more than the exact wording.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_similarity(expected: str, actual: str) -> float:
emb_a = model.encode(expected, convert_to_tensor=True)
emb_b = model.encode(actual, convert_to_tensor=True)
return util.cos_sim(emb_a, emb_b).item()
Best for: open-ended generation where multiple phrasings are acceptable.
Use when quality requires nuanced judgment that heuristics cannot capture.
async def llm_judge(query: str, response: str, rubric: str) -> float:
prompt = f"""Rate the following response on a scale of 0 to 1.
Question: {query}
Response: {response}
Rubric: {rubric}
Return ONLY a JSON object: {{"score": <float>, "reasoning": "<brief explanation>"}}
"""
result = await llm.generate(prompt)
return json.loads(result)["score"]
Best for: faithfulness, helpfulness, tone, safety. I use this pattern heavily. The key insight from recent research is to prefer binary (pass/fail) or 3-point scales over 10-point scales. Binary judgments are significantly more consistent. In studies on LLM-as-judge reliability, few-shot prompting improved GPT-4's consistency from 65% to 77.5%.
Here is my honest assessment of the major frameworks, based on having used them in production.
Best for teams that want YAML-driven, CI-friendly eval.
# promptfooconfig.yaml
prompts:
- "Answer the question based on the context.\n\nContext: {{context}}\nQuestion: {{query}}"
providers:
- id: openai:gpt-4o
config:
temperature: 0
tests:
- vars:
query: "What is the refund policy?"
context: "Refunds are available within 30 days of purchase."
assert:
- type: contains
value: "30 days"
- type: llm-rubric
value: "Response accurately states the refund policy"
- type: cost
threshold: 0.01
Strengths: Declarative config, built-in CI/CD integration, great for prompt regression testing. Weakness: Less flexible for custom metric pipelines.
Best for Python-first teams that want pytest-style eval.
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
)
def test_rag_response():
test_case = LLMTestCase(
input="What is the refund policy?",
actual_output="Refunds are available within 30 days.",
retrieval_context=[
"Refunds are available within 30 days of purchase."
]
)
relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.8)
assert_test(test_case, [relevancy, faithfulness])
Strengths: 60+ built-in metrics, pytest integration, red-teaming support. Weakness: Heavier dependency footprint.
Best for RAG-specific evaluation pipelines.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
eval_dataset = Dataset.from_dict({
"question": ["What is the refund policy?"],
"answer": ["Refunds are available within 30 days."],
"contexts": [["Refunds are available within 30 days of purchase."]],
"ground_truth": ["Refunds within 30 days of purchase."]
})
result = evaluate(eval_dataset, metrics=[
faithfulness, answer_relevancy, context_precision
])
print(result)
# {'faithfulness': 0.95, 'answer_relevancy': 0.91, 'context_precision': 0.88}
Strengths: Purpose-built for RAG, strong academic grounding, reference-free metrics available. Weakness: Narrower scope than general-purpose frameworks.
Best for teams that want a managed platform with production tracing.
Strengths: End-to-end platform (tracing + evals + datasets), production logs become eval datasets automatically, used by Stripe and Notion. Weakness: Closed-source core, vendor lock-in risk.
If you are starting from zero, pick promptfoo for prompt-level regression testing and DeepEval or RAGAS for deeper metric evaluation. You do not need a platform on day one. You need a test suite that runs in CI.
Organize evals into three tiers, just like traditional software testing:
Tier 1: Unit Evals (fast, run on every commit)
├── Exact-match on structured outputs
├── Format validation (JSON schema, length)
└── Basic relevance checks
Tier 2: Integration Evals (moderate, run on PR merge)
├── RAG faithfulness and precision
├── Multi-turn conversation coherence
└── Tool-use accuracy
Tier 3: System Evals (slow, run nightly or weekly)
├── End-to-end user scenario tests
├── Red-teaming and adversarial tests
└── Cost and latency benchmarks
The key principle: Fast feedback on every commit. Deep analysis on a schedule. Never block developers with slow evals when quick ones suffice.
Here is what you build this week. Not next sprint. This week.
# If you chose promptfoo, your first run looks like this:
npx promptfoo@latest init
# Edit promptfooconfig.yaml with your test cases
npx promptfoo@latest eval
npx promptfoo@latest view # See results in browser
# If you chose DeepEval:
pip install deepeval
deepeval test run tests/evals/ --verbose
This is not months of work. I have set up eval suites like this in a single afternoon. The tooling is mature. The hard part is not technology. It is the discipline to treat AI quality as a first-class engineering concern.
You have an eval suite with test cases, scorers, and a runner. But right now your scoring functions are generic. Next, we go deep on the three metrics that matter most -- factuality, relevance, and faithfulness -- so you know exactly what to measure and how to interpret the numbers.