Building a Confidence Dashboard | AI Evaluation & Reliability Engineering | Celestinosalim.com

Building a Confidence Dashboard

You have eval metrics. You have regression tests in CI. But if those numbers live in JSON files and CI logs, they are invisible to the people who decide whether your AI system gets more investment or gets shut down. A confidence dashboard turns your eval data into a story that stakeholders can read in thirty seconds. In this lesson, I will show you how to build one.

Why "Confidence" and Not "Performance"

I deliberately call this a confidence dashboard rather than a performance dashboard. Performance implies speed. Confidence implies trust. What your stakeholders need to know is not "how fast is the AI" but "how much should I trust the AI."

The metrics that build confidence are:

Quality scores (faithfulness, relevance, factuality) -- "Is the output correct?"
Trend direction -- "Is it getting better or worse?"
Coverage -- "How much of our surface area is tested?"
Cost efficiency -- "What are we spending per query?"

When these four dimensions are visible and trending in the right direction, adoption follows naturally. When I built this kind of visibility into my production systems, it was a direct contributor to the reliability that drove a 482% lift in impressions. The outputs were already good. The dashboard proved they were good, and that proof gave stakeholders the confidence to promote the feature more aggressively.

The Four Panels

A confidence dashboard has four panels. Each answers one question.

Panel 1: Quality Over Time

Question: "Is our AI getting better or worse?"

This is a time-series chart showing your core metrics (faithfulness, relevance, factuality) over the last 30-90 days.

// types/eval.ts
interface EvalResult {
  timestamp: string;
  promptVersion: string;
  modelId: string;
  metrics: {
    faithfulness: number;
    relevance: number;
    factuality: number;
  };
  passRate: number;
  totalCases: number;
}

interface DashboardData {
  history: EvalResult[];
  currentBaseline: EvalResult;
  regressionThreshold: number;
}

What to show:

Line chart with one line per metric.
Horizontal threshold line showing your regression boundary.
Annotations on model or prompt changes ("Switched to GPT-4o", "Prompt v2.4 deployed").

What to watch for: A slow downward trend that never triggers a single regression alert but accumulates to a meaningful drop over weeks. This is the drift that dashboards catch and CI misses.

Panel 2: Latest Eval Breakdown

Question: "Where specifically is the system strong and weak?"

This is a breakdown of the most recent eval run, sliced by category.

interface CategoryBreakdown {
  category: string;       // e.g., "pricing", "technical", "policy"
  caseCount: number;
  avgFaithfulness: number;
  avgRelevance: number;
  avgFactuality: number;
  passRate: number;
  worstCase?: {
    input: string;
    output: string;
    score: number;
    failureReason: string;
  };
}

What to show:

Table or heatmap with categories as rows and metrics as columns.
Color coding: green above threshold, yellow within 5%, red below.
Expandable worst-case examples for each category.

This panel is where I spend most of my debugging time. When faithfulness drops, I look here to see which category dropped. A system-wide dip is a model issue. A category-specific dip is a data or prompt issue.

Panel 3: Coverage Map

Question: "How much of our system is tested?"

interface CoverageData {
  totalQueryPatterns: number;    // Estimated from production logs
  coveredByEvals: number;        // Patterns with at least one test case
  coveragePercent: number;
  uncoveredCategories: string[]; // Categories with no test cases
  staleCases: number;            // Cases not updated in 90+ days
}

What to show:

Coverage percentage as a large number.
List of uncovered categories flagged in red.
Count of stale test cases that may no longer reflect real user behavior.

This is the panel that keeps you honest. A passing eval suite with 10% coverage is a false sense of security. I aim for 70%+ coverage of observed query patterns.

Panel 4: Cost and Latency

Question: "Is reliability costing us too much?"

interface CostMetrics {
  avgCostPerQuery: number;
  avgLatencyMs: number;
  p95LatencyMs: number;
  evalCostPerRun: number;     // What the eval suite itself costs
  costTrend: 'increasing' | 'stable' | 'decreasing';
}

What to show:

Cost per query over time.
Latency distribution (p50, p95, p99).
Eval suite cost (because LLM-as-judge evals are not free).

This panel matters because reliability cannot come at infinite cost. If your eval suite costs $50 per run and you run it 20 times a day, that is $1,000/day in eval costs alone. I track this so I can make informed trade-offs between eval depth and budget.

Implementation: The Data Pipeline

The dashboard is only as good as the data feeding it. Here is the pipeline I use.

[CI Eval Run] -> [Results JSON] -> [Storage] -> [Dashboard API] -> [UI]

Step 0: Create the Database Schema

Before storing anything, you need tables. Here is the Supabase migration.

-- supabase/migrations/create_eval_tables.sql

create table eval_runs (
  run_id uuid primary key default gen_random_uuid(),
  timestamp timestamptz not null default now(),
  prompt_version text not null,
  model_id text not null,
  faithfulness numeric(4,3) not null,
  relevance numeric(4,3) not null,
  factuality numeric(4,3) not null,
  pass_rate numeric(4,3) not null,
  total_cases integer not null,
  avg_latency_ms integer,
  avg_cost_per_query numeric(8,6),
  eval_cost numeric(8,4),
  raw_results jsonb,
  created_at timestamptz default now()
);

create table eval_category_breakdowns (
  id uuid primary key default gen_random_uuid(),
  run_id uuid references eval_runs(run_id),
  category text not null,
  case_count integer not null,
  avg_faithfulness numeric(4,3),
  avg_relevance numeric(4,3),
  avg_factuality numeric(4,3),
  pass_rate numeric(4,3),
  worst_case_input text,
  worst_case_output text,
  worst_case_score numeric(4,3),
  failure_reason text,
  timestamp timestamptz not null default now()
);

-- Index for time-series queries on the dashboard
create index idx_eval_runs_timestamp on eval_runs(timestamp desc);
create index idx_eval_breakdowns_run on eval_category_breakdowns(run_id);

-- RLS: only authenticated service role can write
alter table eval_runs enable row level security;
alter table eval_category_breakdowns enable row level security;

create policy "Service role can manage eval_runs"
  on eval_runs for all
  using (auth.role() = 'service_role');

create policy "Authenticated users can read eval_runs"
  on eval_runs for select
  using (auth.role() = 'authenticated');

create policy "Service role can manage breakdowns"
  on eval_category_breakdowns for all
  using (auth.role() = 'service_role');

create policy "Authenticated users can read breakdowns"
  on eval_category_breakdowns for select
  using (auth.role() = 'authenticated');

Step 1: Store Results After Every Eval Run

// scripts/store-eval-results.ts
import { createClient } from '@supabase/supabase-js';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_KEY!
);

interface EvalRunRecord {
  run_id: string;
  timestamp: string;
  prompt_version: string;
  model_id: string;
  faithfulness: number;
  relevance: number;
  factuality: number;
  pass_rate: number;
  total_cases: number;
  avg_latency_ms: number;
  avg_cost_per_query: number;
  eval_cost: number;
  raw_results: Record<string, unknown>;
}

async function storeEvalResults(record: EvalRunRecord) {
  const { error } = await supabase
    .from('eval_runs')
    .insert(record);

  if (error) throw new Error(`Failed to store eval: ${error.message}`);
}

Step 2: API Endpoint for the Dashboard

// app/api/eval-dashboard/route.ts
import { createClient } from '@supabase/supabase-js';
import { NextResponse } from 'next/server';

export async function GET(request: Request) {
  const { searchParams } = new URL(request.url);
  const days = parseInt(searchParams.get('days') ?? '30');

  const supabase = createClient(
    process.env.SUPABASE_URL!,
    process.env.SUPABASE_SERVICE_KEY!
  );

  const since = new Date();
  since.setDate(since.getDate() - days);

  const { data: history } = await supabase
    .from('eval_runs')
    .select('*')
    .gte('timestamp', since.toISOString())
    .order('timestamp', { ascending: true });

  const { data: latest } = await supabase
    .from('eval_category_breakdowns')
    .select('*')
    .order('timestamp', { ascending: false })
    .limit(20);

  return NextResponse.json({
    history: history ?? [],
    latestBreakdown: latest ?? [],
    regressionThreshold: 0.05,
  });
}

Step 3: Alerting

Do not rely on people checking the dashboard. Set up alerts.

async function checkAndAlert(latestRun: EvalRunRecord,
                              baseline: EvalRunRecord) {
  const checks = [
    {
      metric: 'faithfulness',
      current: latestRun.faithfulness,
      baseline: baseline.faithfulness,
    },
    {
      metric: 'relevance',
      current: latestRun.relevance,
      baseline: baseline.relevance,
    },
    {
      metric: 'factuality',
      current: latestRun.factuality,
      baseline: baseline.factuality,
    },
  ];

  const regressions = checks.filter(
    (c) => c.current < c.baseline * 0.95
  );

  if (regressions.length > 0) {
    await sendSlackAlert({
      channel: '#ai-quality',
      text: `Eval regression detected:\n${regressions
        .map(
          (r) =>
            `- ${r.metric}: ${r.current.toFixed(3)} ` +
            `(baseline: ${r.baseline.toFixed(3)})`
        )
        .join('\n')}`,
    });
  }
}

What Stakeholders Actually Look At

I have shown confidence dashboards to engineering managers, product leads, and executives. Here is what each group cares about:

| Stakeholder | Primary Panel | What They Want to Know | |---|---|---| | Engineers | Latest Eval Breakdown | "What broke and where?" | | Product Managers | Quality Over Time | "Is the feature ready to promote?" | | Executives | Quality + Cost | "Is this worth the investment?" |

Design for the product manager. They are the ones who decide whether to put the AI feature in front of more users. If they can see that quality is high and trending stable, they will push for wider rollout. That is how reliability drives adoption.

The Anti-Patterns

Anti-pattern 1: Dashboard without alerts. Nobody checks dashboards proactively. If a regression happens and nobody is notified, the dashboard is decoration.

Anti-pattern 2: Too many metrics. If you show 20 numbers, nobody reads any of them. Four panels, four questions. That is enough.

Anti-pattern 3: No annotations. A quality dip without context is just a scary line. Annotate model changes, prompt updates, and data refreshes so the team can correlate cause and effect.

Anti-pattern 4: Stale baselines. If your baseline is from six months ago and the system has improved significantly, every run looks green. Re-baseline regularly.

Build This: Your Dashboard in an Afternoon

Here is the concrete checklist. Use Recharts, Chart.js, or Tremor for visualization -- any of them work with Next.js.

Run the SQL migration above to create your eval_runs and eval_category_breakdowns tables.
Add a post-eval storage step to your CI pipeline from Lesson 4. After deepeval test run or promptfoo eval, run ts-node scripts/store-eval-results.ts to push results to Supabase.
Create the API route at app/api/eval-dashboard/route.ts using the code above.
Build four panels on a page at /dashboard/evals:
- Panel 1: Time-series line chart of faithfulness, relevance, factuality (use Recharts LineChart).
- Panel 2: Category breakdown table with color-coded cells (green/yellow/red).
- Panel 3: Coverage percentage as a large stat card with uncovered categories listed.
- Panel 4: Cost per query and p95 latency trend lines.
Wire up Slack alerts using the checkAndAlert function. Trigger after every CI eval run.

# Add this step to your .github/workflows/llm-eval.yml
      - name: Store results and check alerts
        env:
          SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
          SUPABASE_SERVICE_KEY: ${{ secrets.SUPABASE_SERVICE_KEY }}
          SLACK_WEBHOOK: ${{ secrets.SLACK_EVAL_WEBHOOK }}
        run: |
          npx ts-node scripts/store-eval-results.ts \
            --results results.json \
            --baseline eval/baseline.json

The first time you show this dashboard in a sprint review, the conversation about your AI feature will change.

Key Takeaways

A confidence dashboard answers four questions: Is quality improving? Where are we weak? How much is tested? What does it cost?
Store every eval run in a database. JSON files in CI are not queryable or trendable.
Alerts are mandatory. Dashboards without alerts are ignored.
Design for the product manager: they decide whether your AI gets promoted to more users.
Annotate changes on the timeline so quality movements can be traced to root causes.

What's Next

You have metrics, automated regression tests, and a dashboard. Next, we connect everything into the reliability flywheel -- the system-level loop that turns eval discipline into compounding adoption and shows you how to keep the whole system running quarter after quarter.