The Vibe Check Problem | AI Evaluation & Reliability Engineering | Celestinosalim.com

The Vibe Check Problem

Here is a specific failure. A SaaS company ships a RAG-powered support chatbot. The product manager reads twenty responses during QA, marks it "ready for production," and the team deploys on a Tuesday. By Friday, three enterprise customers have escalated tickets because the chatbot confidently cited a pricing tier that was deprecated six months ago. The bot was not wrong on most queries. It was wrong on 12% of pricing queries -- a category nobody tested because the PM's twenty manual checks happened to be about feature questions. That 12% error rate cost the company a contract renewal.

The PM did not do anything wrong. They did what almost every AI team does: a vibe check. And the vibe check failed them.

What Is a Vibe Check?

A vibe check is any quality assessment that relies on a human scanning a handful of outputs and forming a subjective opinion. It feels responsible. It feels like due diligence. It is neither.

The vibe check pattern:

Run a few test queries.
Read the outputs.
Decide they "seem fine."
Ship to production.
Hope for the best.

This is how most teams evaluate their AI systems today. And it is why most production AI systems silently degrade.

Why Vibe Checks Fail: Four Structural Problems

1. Coverage Is an Illusion

A typical RAG system handles hundreds or thousands of distinct query patterns. When you check ten outputs manually, you cover less than 1% of the surface area. The failures you miss are not random. They cluster in edge cases that users encounter daily but that never appear in ad hoc testing.

Here is the math. If your system handles 500 distinct query patterns and you manually check 20 of them, you have 4% coverage. If errors occur in 10% of patterns, you have a 12% chance of catching zero errors in your sample. That is not bad luck. That is statistics.

import math

def probability_of_missing_errors(
    total_patterns: int,
    sample_size: int,
    error_rate: float
) -> float:
    """Probability that a random sample catches zero errors."""
    error_patterns = int(total_patterns * error_rate)
    clean_patterns = total_patterns - error_patterns

    # Hypergeometric: P(0 errors in sample)
    p_miss = (
        math.comb(clean_patterns, sample_size)
        / math.comb(total_patterns, sample_size)
    )
    return p_miss

# 500 patterns, 20 sampled, 10% error rate
p = probability_of_missing_errors(500, 20, 0.10)
print(f"Probability of catching zero errors: {p:.1%}")
# Probability of catching zero errors: 12.0%

A 12% chance of total blindness is not an edge case. It is a coin flip you take every time you ship.

2. Human Judgment Drifts

The same reviewer will rate the same output differently on Monday versus Friday. Fatigue, anchoring, and recency bias are not theoretical risks. They are measured phenomena. Research on LLM-as-judge evaluation has shown that even trained annotators achieve only 65-78% inter-rater consistency on quality rubrics. If humans cannot agree with themselves, manual checks are not a measurement system. They are noise.

3. Regression Is Invisible

When you update a prompt, swap a model, or change a retrieval strategy, vibe checks cannot tell you what broke. You have no baseline to compare against. There is no diff. The system could be 20% worse on faithfulness and you would not know until a customer complains -- or worse, until they quietly leave.

4. You Cannot Improve What You Cannot Measure

This is the core issue. Without quantitative metrics, every conversation about quality becomes an opinion debate. "I think it's better." "I think it's worse." "It feels different." These are not engineering conversations. They are arguments with no resolution.

The Real Cost: The Silent Failure Loop

This pattern repeats across organizations of every size.

Team ships AI feature with vibe-check approval.
Feature works well on demo-day queries.
Real users ask harder, messier questions.
System hallucinates on 15% of edge cases.
Users lose trust quietly. Engagement drops.
Team attributes the drop to "user adoption challenges."
Nobody connects the drop to quality. The cycle repeats.

When I first started treating AI quality as an engineering discipline rather than a subjective judgment call, the difference was stark. Replacing vibe checks with automated evaluation harnesses in production systems was the single biggest factor in lifting impressions by 482%. The outputs did not change dramatically. What changed was the ability to find and fix failures systematically, which built the kind of reliability that earns user trust.

What Replaces the Vibe Check

The alternative is not "more careful manual review." The alternative is treating AI evaluation with the same rigor that software engineering applies to testing.

The eval-driven approach:

Define metrics that map to business outcomes (factuality, relevance, faithfulness).
Build golden datasets with known-good input-output pairs.
Automate scoring so every change is measured, not eyeballed.
Set thresholds that gate deployments. If the score drops, the change does not ship.
Track trends so you can see degradation before users feel it.

This is not overhead. This is how you build systems that people actually trust enough to use repeatedly.

The Mindset Shift

| Vibe Check Mindset | Eval Engineering Mindset | |---|---| | "It looks good to me" | "It scores 0.92 on faithfulness" | | "Let's ship and see" | "Let's ship if it passes the gate" | | "Users will tell us if it's broken" | "We will know before users do" | | Quality is an opinion | Quality is a measurement | | Testing happens once | Testing is continuous |

Build This: The Vibe Check Audit

Before you build an eval suite, you need to know how exposed you are right now. Run this diagnostic against your own AI system. It takes 30 minutes and produces a score that tells you how urgently you need the rest of this course.

interface VibeCheckAudit {
  systemName: string;
  auditDate: string;
  questions: {
    // Coverage
    estimatedQueryPatterns: number;
    manuallyTestedPatterns: number;
    coveragePercent: number;

    // Measurement
    hasAutomatedEvals: boolean;
    hasGoldenDataset: boolean;
    goldenDatasetSize: number;
    hasDefinedMetrics: boolean;
    metricsTracked: string[];

    // Regression
    hasBaselineScores: boolean;
    hasRegressionTests: boolean;
    lastEvalRunDate: string | null;
    deploysBlockedByEvals: boolean;

    // Visibility
    hasDashboard: boolean;
    stakeholdersCanSeeMetrics: boolean;
    alertsOnRegression: boolean;
  };
}

function computeReadinessScore(audit: VibeCheckAudit): {
  score: number;
  grade: string;
  priority: string;
} {
  let score = 0;
  const q = audit.questions;

  // Coverage (0-25 points)
  score += Math.min(25, q.coveragePercent / 4);

  // Measurement (0-30 points)
  if (q.hasAutomatedEvals) score += 10;
  if (q.hasGoldenDataset) score += 5;
  if (q.goldenDatasetSize >= 50) score += 5;
  if (q.hasDefinedMetrics) score += 5;
  if (q.metricsTracked.length >= 3) score += 5;

  // Regression (0-25 points)
  if (q.hasBaselineScores) score += 8;
  if (q.hasRegressionTests) score += 9;
  if (q.deploysBlockedByEvals) score += 8;

  // Visibility (0-20 points)
  if (q.hasDashboard) score += 8;
  if (q.stakeholdersCanSeeMetrics) score += 6;
  if (q.alertsOnRegression) score += 6;

  const grade =
    score >= 80 ? 'A: Eval-driven' :
    score >= 60 ? 'B: Partially measured' :
    score >= 30 ? 'C: Mostly vibes' :
    'D: Flying blind';

  const priority =
    score >= 80 ? 'Refine and expand existing evals' :
    score >= 60 ? 'Automate and add regression gates' :
    score >= 30 ? 'Build golden dataset and basic evals immediately' :
    'Stop shipping until you have measurement in place';

  return { score, grade, priority };
}

Run this against your system. Write down the number. When you finish this course and run the audit again, you will have a concrete measure of progress.

Key Takeaways

Vibe checks are the default in most AI teams, and they are a structural liability, not a minor gap.
The math is against you. Manual sampling of 20 queries from 500 patterns has a 12% chance of catching zero errors.
Human judgment is inconsistent, even among experts, making manual review unreliable at scale.
Silent degradation is the real risk: systems break in ways that nobody notices until trust is already lost.
Automated evaluation is not a luxury. It is the foundation of reliability.
Measurement enables improvement. Without it, you are guessing.

What's Next

In the next lesson, you will design your first eval suite from scratch, starting with the golden dataset that makes everything else possible. You will take the readiness score from your audit and build the specific components that fill the gaps.