Start Lesson
Here is a specific failure. A SaaS company ships a RAG-powered support chatbot. The product manager reads twenty responses during QA, marks it "ready for production," and the team deploys on a Tuesday. By Friday, three enterprise customers have escalated tickets because the chatbot confidently cited a pricing tier that was deprecated six months ago. The bot was not wrong on most queries. It was wrong on 12% of pricing queries -- a category nobody tested because the PM's twenty manual checks happened to be about feature questions. That 12% error rate cost the company a contract renewal.
The PM did not do anything wrong. They did what almost every AI team does: a vibe check. And the vibe check failed them.
A vibe check is any quality assessment that relies on a human scanning a handful of outputs and forming a subjective opinion. It feels responsible. It feels like due diligence. It is neither.
The vibe check pattern:
This is how most teams evaluate their AI systems today. And it is why most production AI systems silently degrade.
A typical RAG system handles hundreds or thousands of distinct query patterns. When you check ten outputs manually, you cover less than 1% of the surface area. The failures you miss are not random. They cluster in edge cases that users encounter daily but that never appear in ad hoc testing.
Here is the math. If your system handles 500 distinct query patterns and you manually check 20 of them, you have 4% coverage. If errors occur in 10% of patterns, you have a 12% chance of catching zero errors in your sample. That is not bad luck. That is statistics.
import math
def probability_of_missing_errors(
total_patterns: int,
sample_size: int,
error_rate: float
) -> float:
"""Probability that a random sample catches zero errors."""
error_patterns = int(total_patterns * error_rate)
clean_patterns = total_patterns - error_patterns
# Hypergeometric: P(0 errors in sample)
p_miss = (
math.comb(clean_patterns, sample_size)
/ math.comb(total_patterns, sample_size)
)
return p_miss
# 500 patterns, 20 sampled, 10% error rate
p = probability_of_missing_errors(500, 20, 0.10)
print(f"Probability of catching zero errors: {p:.1%}")
# Probability of catching zero errors: 12.0%
A 12% chance of total blindness is not an edge case. It is a coin flip you take every time you ship.
The same reviewer will rate the same output differently on Monday versus Friday. Fatigue, anchoring, and recency bias are not theoretical risks. They are measured phenomena. Research on LLM-as-judge evaluation has shown that even trained annotators achieve only 65-78% inter-rater consistency on quality rubrics. If humans cannot agree with themselves, manual checks are not a measurement system. They are noise.
When you update a prompt, swap a model, or change a retrieval strategy, vibe checks cannot tell you what broke. You have no baseline to compare against. There is no diff. The system could be 20% worse on faithfulness and you would not know until a customer complains -- or worse, until they quietly leave.
This is the core issue. Without quantitative metrics, every conversation about quality becomes an opinion debate. "I think it's better." "I think it's worse." "It feels different." These are not engineering conversations. They are arguments with no resolution.
This pattern repeats across organizations of every size.
When I first started treating AI quality as an engineering discipline rather than a subjective judgment call, the difference was stark. Replacing vibe checks with automated evaluation harnesses in production systems was the single biggest factor in lifting impressions by 482%. The outputs did not change dramatically. What changed was the ability to find and fix failures systematically, which built the kind of reliability that earns user trust.
The alternative is not "more careful manual review." The alternative is treating AI evaluation with the same rigor that software engineering applies to testing.
The eval-driven approach:
This is not overhead. This is how you build systems that people actually trust enough to use repeatedly.
| Vibe Check Mindset | Eval Engineering Mindset | |---|---| | "It looks good to me" | "It scores 0.92 on faithfulness" | | "Let's ship and see" | "Let's ship if it passes the gate" | | "Users will tell us if it's broken" | "We will know before users do" | | Quality is an opinion | Quality is a measurement | | Testing happens once | Testing is continuous |
Before you build an eval suite, you need to know how exposed you are right now. Run this diagnostic against your own AI system. It takes 30 minutes and produces a score that tells you how urgently you need the rest of this course.
interface VibeCheckAudit {
systemName: string;
auditDate: string;
questions: {
// Coverage
estimatedQueryPatterns: number;
manuallyTestedPatterns: number;
coveragePercent: number;
// Measurement
hasAutomatedEvals: boolean;
hasGoldenDataset: boolean;
goldenDatasetSize: number;
hasDefinedMetrics: boolean;
metricsTracked: string[];
// Regression
hasBaselineScores: boolean;
hasRegressionTests: boolean;
lastEvalRunDate: string | null;
deploysBlockedByEvals: boolean;
// Visibility
hasDashboard: boolean;
stakeholdersCanSeeMetrics: boolean;
alertsOnRegression: boolean;
};
}
function computeReadinessScore(audit: VibeCheckAudit): {
score: number;
grade: string;
priority: string;
} {
let score = 0;
const q = audit.questions;
// Coverage (0-25 points)
score += Math.min(25, q.coveragePercent / 4);
// Measurement (0-30 points)
if (q.hasAutomatedEvals) score += 10;
if (q.hasGoldenDataset) score += 5;
if (q.goldenDatasetSize >= 50) score += 5;
if (q.hasDefinedMetrics) score += 5;
if (q.metricsTracked.length >= 3) score += 5;
// Regression (0-25 points)
if (q.hasBaselineScores) score += 8;
if (q.hasRegressionTests) score += 9;
if (q.deploysBlockedByEvals) score += 8;
// Visibility (0-20 points)
if (q.hasDashboard) score += 8;
if (q.stakeholdersCanSeeMetrics) score += 6;
if (q.alertsOnRegression) score += 6;
const grade =
score >= 80 ? 'A: Eval-driven' :
score >= 60 ? 'B: Partially measured' :
score >= 30 ? 'C: Mostly vibes' :
'D: Flying blind';
const priority =
score >= 80 ? 'Refine and expand existing evals' :
score >= 60 ? 'Automate and add regression gates' :
score >= 30 ? 'Build golden dataset and basic evals immediately' :
'Stop shipping until you have measurement in place';
return { score, grade, priority };
}
Run this against your system. Write down the number. When you finish this course and run the audit again, you will have a concrete measure of progress.
In the next lesson, you will design your first eval suite from scratch, starting with the golden dataset that makes everything else possible. You will take the readiness score from your audit and build the specific components that fill the gaps.