Automated Regression Testing for LLMs | AI Evaluation & Reliability Engineering | Celestinosalim.com

Automated Regression Testing for LLMs

Traditional regression testing is straightforward: run the same inputs, expect the same outputs, fail if they differ. LLMs break this assumption completely. The same prompt can produce different outputs on every call. Temperature, model updates, and even server-side batching introduce variance that makes exact-match testing useless. In this lesson, I will show you how to build regression testing that works for non-deterministic systems.

Why Traditional Regression Testing Fails

In conventional software, a function is deterministic. add(2, 3) always returns 5. If it returns 6 after a code change, the test fails. The signal is clear.

LLMs are different:

# Run the same prompt three times
responses = [llm("Summarize the Q3 report") for _ in range(3)]

# Get three different outputs:
# "Q3 revenue grew 12% year-over-year..."
# "The third quarter showed strong revenue growth of 12%..."
# "In Q3, the company reported a 12% increase in revenue..."

All three are correct. None are identical. An exact-match test would fail every time, even when the system is working perfectly.

This is the fundamental challenge: how do you detect real regression when natural variance is expected?

The Two-Track Strategy

I use a two-track approach that separates deterministic checks from semantic checks.

Track 1: Deterministic Tests (Temperature 0)

For any eval where there is a single correct answer, eliminate variance at the source.

import openai

# Force deterministic output
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,          # Eliminate sampling randomness
    seed=42,                # Pin the random seed (OpenAI)
    messages=[{"role": "user", "content": "Extract the date: 'Meeting on March 15'"}]
)
# Consistently returns: "March 15"

Use deterministic tests for:

Entity extraction
Classification tasks
Structured output generation (JSON, SQL)
Yes/no questions with clear answers

Track 2: Semantic Tests (Statistical Assertions)

For open-ended generation, test meaning rather than wording.

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

def test_summary_quality():
    """Regression test: Q3 report summary should remain
    relevant and faithful after prompt changes."""
    test_case = LLMTestCase(
        input="Summarize the Q3 earnings report",
        actual_output=generate_summary(q3_report),
        retrieval_context=[q3_report],
    )

    # These thresholds are our regression baseline
    relevancy = AnswerRelevancyMetric(threshold=0.85)
    faithfulness = FaithfulnessMetric(threshold=0.80)

    relevancy.measure(test_case)
    faithfulness.measure(test_case)

    assert relevancy.score >= 0.85, (
        f"Relevancy regressed: {relevancy.score:.2f} < 0.85"
    )
    assert faithfulness.score >= 0.80, (
        f"Faithfulness regressed: {faithfulness.score:.2f} < 0.80"
    )

The key principle: Do not assert on the output text. Assert on the metric score. The text can vary freely as long as the quality stays above your threshold.

Setting Baselines

A regression test is meaningless without a baseline. Here is how I establish one.

Step 1: Run Your Eval Suite Against the Current System

# Using promptfoo
npx promptfoo@latest eval -c promptfooconfig.yaml -o baseline.json

# Or using DeepEval
deepeval test run tests/evals/ --output baseline.json

Step 2: Record the Scores

{
  "baseline": {
    "date": "2026-02-25",
    "model": "gpt-4o-2025-11-20",
    "prompt_version": "v2.3",
    "scores": {
      "faithfulness": 0.91,
      "relevance": 0.94,
      "factuality": 0.89,
      "avg_latency_ms": 1200,
      "avg_cost_per_query": 0.003
    },
    "pass_rate": 0.96
  }
}

Step 3: Set Regression Thresholds

I typically set the regression threshold at 95% of the baseline. If faithfulness was 0.91, the regression threshold is 0.865. This accounts for natural variance while catching real degradation.

REGRESSION_TOLERANCE = 0.05  # Allow 5% drop before alerting

def check_regression(current_score: float, baseline_score: float,
                     metric_name: str) -> bool:
    threshold = baseline_score * (1 - REGRESSION_TOLERANCE)
    if current_score < threshold:
        raise RegressionError(
            f"{metric_name} regressed: {current_score:.3f} < "
            f"{threshold:.3f} (baseline: {baseline_score:.3f})"
        )
    return True

CI/CD Integration

Here is a production-ready GitHub Actions workflow that runs evals on every pull request.

# .github/workflows/llm-eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/ai/**'
      - 'eval/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install deepeval ragas openai

      - name: Run eval suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          deepeval test run tests/evals/ \
            --output results.json \
            --verbose

      - name: Check regression
        run: |
          python scripts/check_regression.py \
            --baseline eval/baseline.json \
            --current results.json \
            --tolerance 0.05

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results.json

With promptfoo, the CI integration is even simpler:

      - name: Run promptfoo eval
        run: |
          npx promptfoo@latest eval \
            -c eval/promptfooconfig.yaml \
            -o results.json \
            --grader openai:gpt-4o

      - name: Assert pass rate
        run: |
          npx promptfoo@latest eval \
            -c eval/promptfooconfig.yaml \
            --fail-on-error \
            --threshold 0.95

The --threshold 0.95 flag tells promptfoo to exit with a non-zero code if less than 95% of test cases pass. This blocks the PR from merging.

Handling Non-Determinism in CI

LLM evals in CI have a specific challenge: flakiness. A test that passes 95% of the time will fail on one out of every twenty CI runs, creating noise that erodes trust in your test suite.

Strategies I use to manage flakiness:

1. Run Multiple Trials

def eval_with_retry(test_case, metric, trials=3) -> float:
    """Run the eval multiple times and take the median score."""
    scores = []
    for _ in range(trials):
        metric.measure(test_case)
        scores.append(metric.score)
    return sorted(scores)[len(scores) // 2]  # Median

2. Use Statistical Thresholds

Instead of "this one run must pass," assert that the pass rate across all cases exceeds a threshold.

def check_suite_health(results: list[float], min_pass_rate=0.90):
    """Allow individual case failures if overall health is good."""
    passed = sum(1 for r in results if r >= 0.8)
    pass_rate = passed / len(results)
    assert pass_rate >= min_pass_rate, (
        f"Suite pass rate {pass_rate:.1%} below minimum {min_pass_rate:.1%}"
    )

3. Separate Blocking vs. Informational Evals

Not every eval should block a PR. I categorize evals as:

Blocking: Core faithfulness and factuality on golden test cases. Must pass to merge.
Informational: Edge cases, adversarial tests, latency benchmarks. Reported but do not block.

This keeps CI fast and trustworthy while still giving visibility into the full quality picture.

When to Re-Baseline

Your baseline is not permanent. Re-baseline when:

You upgrade the underlying model. GPT-4o and GPT-4o-mini have different capability profiles.
You significantly change the prompt. A rewrite is a new starting point, not a regression.
You change the retrieval pipeline. New embeddings or chunking strategies shift the baseline.
Scores consistently exceed the baseline by a wide margin. Raise the bar.

I re-baseline roughly once per quarter, or whenever a major system component changes.

Build This: Your First CI Eval Pipeline

By the end of today, you should have:

A baseline file (eval/baseline.json) with your current scores from Lesson 3.
A regression check script (scripts/check_regression.py) using the threshold logic above.
A GitHub Actions workflow (.github/workflows/llm-eval.yml) that runs evals on every PR that touches prompts or AI code.
A classification of your test cases into blocking (must-pass to merge) and informational (reported but not gating).

# scripts/check_regression.py
import json
import sys
import argparse

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--baseline', required=True)
    parser.add_argument('--current', required=True)
    parser.add_argument('--tolerance', type=float, default=0.05)
    args = parser.parse_args()

    with open(args.baseline) as f:
        baseline = json.load(f)["baseline"]["scores"]
    with open(args.current) as f:
        current = json.load(f)

    regressions = []
    for metric in ["faithfulness", "relevance", "factuality"]:
        threshold = baseline[metric] * (1 - args.tolerance)
        actual = current.get(metric, 0)
        if actual < threshold:
            regressions.append(
                f"{metric}: {actual:.3f} < {threshold:.3f} "
                f"(baseline: {baseline[metric]:.3f})"
            )

    if regressions:
        print("REGRESSION DETECTED:")
        for r in regressions:
            print(f"  - {r}")
        sys.exit(1)
    else:
        print("All metrics within tolerance. No regression.")
        sys.exit(0)

if __name__ == "__main__":
    main()

Commit this alongside your eval config. The next time someone changes a prompt, CI will tell them whether they broke something.

Key Takeaways

Exact-match regression testing does not work for LLMs. Use metric-based assertions instead.
Two-track strategy: Deterministic tests (temperature=0) for structured outputs, semantic tests for open-ended generation.
Set baselines by running your full eval suite and recording scores. Regress against those scores.
Integrate into CI/CD so every prompt or model change is automatically evaluated before merging.
Manage flakiness with multiple trials, statistical thresholds, and blocking vs. informational eval tiers.
Re-baseline when you make intentional, significant changes to the system.

What's Next

Your evals now run automatically and catch regressions. But the results are trapped in CI logs and JSON files. Next, we turn those numbers into a confidence dashboard that stakeholders can read in thirty seconds -- the artifact that converts eval discipline into organizational trust.