Designing Your First Eval Suite | AI Evaluation & Reliability Engineering | Celestinosalim.com

Designing Your First Eval Suite

In the previous lesson, I explained why vibe checks fail. Now I will show you how to build the system that replaces them. An eval suite is not a one-time test. It is a living pipeline that runs every time you change your system. By the end of this lesson, you will have the blueprint for one.

The Three Components of Every Eval

Every evaluation system, regardless of framework, reduces to three pieces:

Test cases -- inputs with expected behavior.
A runner -- something that feeds inputs to your system and captures outputs.
Scoring functions -- logic that compares outputs against expectations and produces a number.

That is it. Everything else is tooling and convenience. If you understand these three pieces, you can evaluate any AI system.

Step 1: Build Your Golden Dataset

A golden dataset is a curated set of input-output pairs where you know what "good" looks like. This is the foundation. Without it, nothing else works.

What goes into a golden dataset:

interface EvalCase {
  id: string;
  input: string;            // The user query or prompt
  expectedOutput?: string;  // Ideal response (if you have one)
  context?: string[];       // Retrieved documents (for RAG evals)
  metadata: {
    category: string;       // e.g., "factual", "summarization", "code"
    difficulty: string;     // e.g., "easy", "medium", "hard"
    source: string;         // Where this case came from
  };
}

Where to source test cases:

| Source | Strength | Watch Out For | |---|---|---| | Production logs | Real user behavior | May contain PII | | Support tickets | Known failure modes | Selection bias toward complaints | | Domain experts | High-quality edge cases | Expensive, slow to collect | | Synthetic generation | Scale, coverage | Can miss real-world messiness | | Red-teaming sessions | Adversarial coverage | Overfits to attack patterns |

My rule of thumb: Start with 50 cases. Get 20 from production logs, 15 from known failure modes, 10 from domain experts, and 5 adversarial cases. You can grow from there, but 50 well-chosen cases will catch most regressions.

Example golden dataset rows for a customer support RAG system:

| id | input | expectedOutput | context | category | difficulty | |---|---|---|---|---|---| | CS-001 | "What is the refund policy?" | "Refunds within 30 days of purchase, processed in 5 business days" | ["Refund policy doc section 2.1"] | policy | easy | | CS-002 | "Can I get a refund after 45 days?" | "Refunds are only available within 30 days. Contact support for exceptions." | ["Refund policy doc section 2.1"] | policy | medium | | CS-003 | "I bought the enterprise plan but need to downgrade" | "Enterprise downgrades require contacting your account manager..." | ["Billing docs section 4.3", "Enterprise terms"] | billing | hard | | CS-004 | "your product sucks give me my money back" | Empathetic acknowledgment + refund policy + escalation path | ["Refund policy doc", "Customer service guidelines"] | adversarial | hard | | CS-005 | "What integrations do you support?" | "We support Slack, GitHub, Jira, and Salesforce..." | ["Integrations overview page"] | factual | easy |

Notice the range: easy factual lookups, boundary conditions (45 days vs. 30-day policy), multi-document reasoning (enterprise downgrades), and adversarial phrasing. This is the diversity that catches real failures.

Step 2: Choose Your Scoring Strategy

Not all evals need the same scoring approach. Match your scorer to what you are measuring.

Exact Match

Use when there is a single correct answer.

def exact_match(expected: str, actual: str) -> float:
    return 1.0 if expected.strip().lower() == actual.strip().lower() else 0.0

Best for: classification tasks, entity extraction, structured outputs.

Semantic Similarity

Use when the meaning matters more than the exact wording.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(expected: str, actual: str) -> float:
    emb_a = model.encode(expected, convert_to_tensor=True)
    emb_b = model.encode(actual, convert_to_tensor=True)
    return util.cos_sim(emb_a, emb_b).item()

Best for: open-ended generation where multiple phrasings are acceptable.

LLM-as-Judge

Use when quality requires nuanced judgment that heuristics cannot capture.

async def llm_judge(query: str, response: str, rubric: str) -> float:
    prompt = f"""Rate the following response on a scale of 0 to 1.

Question: {query}
Response: {response}

Rubric: {rubric}

Return ONLY a JSON object: {{"score": <float>, "reasoning": "<brief explanation>"}}
"""
    result = await llm.generate(prompt)
    return json.loads(result)["score"]

Best for: faithfulness, helpfulness, tone, safety. I use this pattern heavily. The key insight from recent research is to prefer binary (pass/fail) or 3-point scales over 10-point scales. Binary judgments are significantly more consistent. In studies on LLM-as-judge reliability, few-shot prompting improved GPT-4's consistency from 65% to 77.5%.

Step 3: Choose Your Framework

Here is my honest assessment of the major frameworks, based on having used them in production.

promptfoo

Best for teams that want YAML-driven, CI-friendly eval.

# promptfooconfig.yaml
prompts:
  - "Answer the question based on the context.\n\nContext: {{context}}\nQuestion: {{query}}"

providers:
  - id: openai:gpt-4o
    config:
      temperature: 0

tests:
  - vars:
      query: "What is the refund policy?"
      context: "Refunds are available within 30 days of purchase."
    assert:
      - type: contains
        value: "30 days"
      - type: llm-rubric
        value: "Response accurately states the refund policy"
      - type: cost
        threshold: 0.01

Strengths: Declarative config, built-in CI/CD integration, great for prompt regression testing. Weakness: Less flexible for custom metric pipelines.

DeepEval

Best for Python-first teams that want pytest-style eval.

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
)

def test_rag_response():
    test_case = LLMTestCase(
        input="What is the refund policy?",
        actual_output="Refunds are available within 30 days.",
        retrieval_context=[
            "Refunds are available within 30 days of purchase."
        ]
    )
    relevancy = AnswerRelevancyMetric(threshold=0.7)
    faithfulness = FaithfulnessMetric(threshold=0.8)
    assert_test(test_case, [relevancy, faithfulness])

Strengths: 60+ built-in metrics, pytest integration, red-teaming support. Weakness: Heavier dependency footprint.

RAGAS

Best for RAG-specific evaluation pipelines.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": ["What is the refund policy?"],
    "answer": ["Refunds are available within 30 days."],
    "contexts": [["Refunds are available within 30 days of purchase."]],
    "ground_truth": ["Refunds within 30 days of purchase."]
})

result = evaluate(eval_dataset, metrics=[
    faithfulness, answer_relevancy, context_precision
])
print(result)
# {'faithfulness': 0.95, 'answer_relevancy': 0.91, 'context_precision': 0.88}

Strengths: Purpose-built for RAG, strong academic grounding, reference-free metrics available. Weakness: Narrower scope than general-purpose frameworks.

Braintrust

Best for teams that want a managed platform with production tracing.

Strengths: End-to-end platform (tracing + evals + datasets), production logs become eval datasets automatically, used by Stripe and Notion. Weakness: Closed-source core, vendor lock-in risk.

My Recommendation

If you are starting from zero, pick promptfoo for prompt-level regression testing and DeepEval or RAGAS for deeper metric evaluation. You do not need a platform on day one. You need a test suite that runs in CI.

Step 4: Structure Your Eval Suite

Organize evals into three tiers, just like traditional software testing:

Tier 1: Unit Evals (fast, run on every commit)
├── Exact-match on structured outputs
├── Format validation (JSON schema, length)
└── Basic relevance checks

Tier 2: Integration Evals (moderate, run on PR merge)
├── RAG faithfulness and precision
├── Multi-turn conversation coherence
└── Tool-use accuracy

Tier 3: System Evals (slow, run nightly or weekly)
├── End-to-end user scenario tests
├── Red-teaming and adversarial tests
└── Cost and latency benchmarks

The key principle: Fast feedback on every commit. Deep analysis on a schedule. Never block developers with slow evals when quick ones suffice.

Build This: Your Minimum Viable Eval Suite

Here is what you build this week. Not next sprint. This week.

Create 50 golden test cases in a JSON or CSV file. Use the sourcing ratios above: 20 from production logs, 15 from known failure modes, 10 from domain experts, 5 adversarial.
Implement 3 scoring functions: exact match for structured outputs, semantic similarity for free text, LLM-as-judge for quality.
Wire up a runner that executes automatically on code changes (promptfoo CLI or pytest with DeepEval).
Set a pass/fail threshold (start with 0.85 pass rate) that blocks deployment when quality drops.

# If you chose promptfoo, your first run looks like this:
npx promptfoo@latest init
# Edit promptfooconfig.yaml with your test cases
npx promptfoo@latest eval
npx promptfoo@latest view  # See results in browser

# If you chose DeepEval:
pip install deepeval
deepeval test run tests/evals/ --verbose

This is not months of work. I have set up eval suites like this in a single afternoon. The tooling is mature. The hard part is not technology. It is the discipline to treat AI quality as a first-class engineering concern.

Key Takeaways

Every eval suite has three parts: test cases, a runner, and scoring functions.
Golden datasets are the foundation. Start with 50 curated cases.
Match your scoring strategy to what you are measuring: exact match, semantic similarity, or LLM-as-judge.
promptfoo, DeepEval, and RAGAS are the leading open-source frameworks. Pick based on your stack and scope.
Organize evals into tiers (unit, integration, system) just like traditional testing.

What's Next

You have an eval suite with test cases, scorers, and a runner. But right now your scoring functions are generic. Next, we go deep on the three metrics that matter most -- factuality, relevance, and faithfulness -- so you know exactly what to measure and how to interpret the numbers.