Start Lesson
Traditional regression testing is straightforward: run the same inputs, expect the same outputs, fail if they differ. LLMs break this assumption completely. The same prompt can produce different outputs on every call. Temperature, model updates, and even server-side batching introduce variance that makes exact-match testing useless. In this lesson, I will show you how to build regression testing that works for non-deterministic systems.
In conventional software, a function is deterministic. add(2, 3) always returns 5. If it returns 6 after a code change, the test fails. The signal is clear.
LLMs are different:
# Run the same prompt three times
responses = [llm("Summarize the Q3 report") for _ in range(3)]
# Get three different outputs:
# "Q3 revenue grew 12% year-over-year..."
# "The third quarter showed strong revenue growth of 12%..."
# "In Q3, the company reported a 12% increase in revenue..."
All three are correct. None are identical. An exact-match test would fail every time, even when the system is working perfectly.
This is the fundamental challenge: how do you detect real regression when natural variance is expected?
I use a two-track approach that separates deterministic checks from semantic checks.
For any eval where there is a single correct answer, eliminate variance at the source.
import openai
# Force deterministic output
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
temperature=0, # Eliminate sampling randomness
seed=42, # Pin the random seed (OpenAI)
messages=[{"role": "user", "content": "Extract the date: 'Meeting on March 15'"}]
)
# Consistently returns: "March 15"
Use deterministic tests for:
For open-ended generation, test meaning rather than wording.
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
def test_summary_quality():
"""Regression test: Q3 report summary should remain
relevant and faithful after prompt changes."""
test_case = LLMTestCase(
input="Summarize the Q3 earnings report",
actual_output=generate_summary(q3_report),
retrieval_context=[q3_report],
)
# These thresholds are our regression baseline
relevancy = AnswerRelevancyMetric(threshold=0.85)
faithfulness = FaithfulnessMetric(threshold=0.80)
relevancy.measure(test_case)
faithfulness.measure(test_case)
assert relevancy.score >= 0.85, (
f"Relevancy regressed: {relevancy.score:.2f} < 0.85"
)
assert faithfulness.score >= 0.80, (
f"Faithfulness regressed: {faithfulness.score:.2f} < 0.80"
)
The key principle: Do not assert on the output text. Assert on the metric score. The text can vary freely as long as the quality stays above your threshold.
A regression test is meaningless without a baseline. Here is how I establish one.
# Using promptfoo
npx promptfoo@latest eval -c promptfooconfig.yaml -o baseline.json
# Or using DeepEval
deepeval test run tests/evals/ --output baseline.json
{
"baseline": {
"date": "2026-02-25",
"model": "gpt-4o-2025-11-20",
"prompt_version": "v2.3",
"scores": {
"faithfulness": 0.91,
"relevance": 0.94,
"factuality": 0.89,
"avg_latency_ms": 1200,
"avg_cost_per_query": 0.003
},
"pass_rate": 0.96
}
}
I typically set the regression threshold at 95% of the baseline. If faithfulness was 0.91, the regression threshold is 0.865. This accounts for natural variance while catching real degradation.
REGRESSION_TOLERANCE = 0.05 # Allow 5% drop before alerting
def check_regression(current_score: float, baseline_score: float,
metric_name: str) -> bool:
threshold = baseline_score * (1 - REGRESSION_TOLERANCE)
if current_score < threshold:
raise RegressionError(
f"{metric_name} regressed: {current_score:.3f} < "
f"{threshold:.3f} (baseline: {baseline_score:.3f})"
)
return True
Here is a production-ready GitHub Actions workflow that runs evals on every pull request.
# .github/workflows/llm-eval.yml
name: LLM Evaluation
on:
pull_request:
paths:
- 'prompts/**'
- 'src/ai/**'
- 'eval/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install deepeval ragas openai
- name: Run eval suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
deepeval test run tests/evals/ \
--output results.json \
--verbose
- name: Check regression
run: |
python scripts/check_regression.py \
--baseline eval/baseline.json \
--current results.json \
--tolerance 0.05
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results.json
With promptfoo, the CI integration is even simpler:
- name: Run promptfoo eval
run: |
npx promptfoo@latest eval \
-c eval/promptfooconfig.yaml \
-o results.json \
--grader openai:gpt-4o
- name: Assert pass rate
run: |
npx promptfoo@latest eval \
-c eval/promptfooconfig.yaml \
--fail-on-error \
--threshold 0.95
The --threshold 0.95 flag tells promptfoo to exit with a non-zero code if less than 95% of test cases pass. This blocks the PR from merging.
LLM evals in CI have a specific challenge: flakiness. A test that passes 95% of the time will fail on one out of every twenty CI runs, creating noise that erodes trust in your test suite.
Strategies I use to manage flakiness:
def eval_with_retry(test_case, metric, trials=3) -> float:
"""Run the eval multiple times and take the median score."""
scores = []
for _ in range(trials):
metric.measure(test_case)
scores.append(metric.score)
return sorted(scores)[len(scores) // 2] # Median
Instead of "this one run must pass," assert that the pass rate across all cases exceeds a threshold.
def check_suite_health(results: list[float], min_pass_rate=0.90):
"""Allow individual case failures if overall health is good."""
passed = sum(1 for r in results if r >= 0.8)
pass_rate = passed / len(results)
assert pass_rate >= min_pass_rate, (
f"Suite pass rate {pass_rate:.1%} below minimum {min_pass_rate:.1%}"
)
Not every eval should block a PR. I categorize evals as:
This keeps CI fast and trustworthy while still giving visibility into the full quality picture.
Your baseline is not permanent. Re-baseline when:
I re-baseline roughly once per quarter, or whenever a major system component changes.
By the end of today, you should have:
eval/baseline.json) with your current scores from Lesson 3.scripts/check_regression.py) using the threshold logic above..github/workflows/llm-eval.yml) that runs evals on every PR that touches prompts or AI code.# scripts/check_regression.py
import json
import sys
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--baseline', required=True)
parser.add_argument('--current', required=True)
parser.add_argument('--tolerance', type=float, default=0.05)
args = parser.parse_args()
with open(args.baseline) as f:
baseline = json.load(f)["baseline"]["scores"]
with open(args.current) as f:
current = json.load(f)
regressions = []
for metric in ["faithfulness", "relevance", "factuality"]:
threshold = baseline[metric] * (1 - args.tolerance)
actual = current.get(metric, 0)
if actual < threshold:
regressions.append(
f"{metric}: {actual:.3f} < {threshold:.3f} "
f"(baseline: {baseline[metric]:.3f})"
)
if regressions:
print("REGRESSION DETECTED:")
for r in regressions:
print(f" - {r}")
sys.exit(1)
else:
print("All metrics within tolerance. No regression.")
sys.exit(0)
if __name__ == "__main__":
main()
Commit this alongside your eval config. The next time someone changes a prompt, CI will tell them whether they broke something.
Your evals now run automatically and catch regressions. But the results are trapped in CI logs and JSON files. Next, we turn those numbers into a confidence dashboard that stakeholders can read in thirty seconds -- the artifact that converts eval discipline into organizational trust.