Start Lesson
You have eval metrics. You have regression tests in CI. But if those numbers live in JSON files and CI logs, they are invisible to the people who decide whether your AI system gets more investment or gets shut down. A confidence dashboard turns your eval data into a story that stakeholders can read in thirty seconds. In this lesson, I will show you how to build one.
I deliberately call this a confidence dashboard rather than a performance dashboard. Performance implies speed. Confidence implies trust. What your stakeholders need to know is not "how fast is the AI" but "how much should I trust the AI."
The metrics that build confidence are:
When these four dimensions are visible and trending in the right direction, adoption follows naturally. When I built this kind of visibility into my production systems, it was a direct contributor to the reliability that drove a 482% lift in impressions. The outputs were already good. The dashboard proved they were good, and that proof gave stakeholders the confidence to promote the feature more aggressively.
A confidence dashboard has four panels. Each answers one question.
Question: "Is our AI getting better or worse?"
This is a time-series chart showing your core metrics (faithfulness, relevance, factuality) over the last 30-90 days.
// types/eval.ts
interface EvalResult {
timestamp: string;
promptVersion: string;
modelId: string;
metrics: {
faithfulness: number;
relevance: number;
factuality: number;
};
passRate: number;
totalCases: number;
}
interface DashboardData {
history: EvalResult[];
currentBaseline: EvalResult;
regressionThreshold: number;
}
What to show:
What to watch for: A slow downward trend that never triggers a single regression alert but accumulates to a meaningful drop over weeks. This is the drift that dashboards catch and CI misses.
Question: "Where specifically is the system strong and weak?"
This is a breakdown of the most recent eval run, sliced by category.
interface CategoryBreakdown {
category: string; // e.g., "pricing", "technical", "policy"
caseCount: number;
avgFaithfulness: number;
avgRelevance: number;
avgFactuality: number;
passRate: number;
worstCase?: {
input: string;
output: string;
score: number;
failureReason: string;
};
}
What to show:
This panel is where I spend most of my debugging time. When faithfulness drops, I look here to see which category dropped. A system-wide dip is a model issue. A category-specific dip is a data or prompt issue.
Question: "How much of our system is tested?"
interface CoverageData {
totalQueryPatterns: number; // Estimated from production logs
coveredByEvals: number; // Patterns with at least one test case
coveragePercent: number;
uncoveredCategories: string[]; // Categories with no test cases
staleCases: number; // Cases not updated in 90+ days
}
What to show:
This is the panel that keeps you honest. A passing eval suite with 10% coverage is a false sense of security. I aim for 70%+ coverage of observed query patterns.
Question: "Is reliability costing us too much?"
interface CostMetrics {
avgCostPerQuery: number;
avgLatencyMs: number;
p95LatencyMs: number;
evalCostPerRun: number; // What the eval suite itself costs
costTrend: 'increasing' | 'stable' | 'decreasing';
}
What to show:
This panel matters because reliability cannot come at infinite cost. If your eval suite costs $50 per run and you run it 20 times a day, that is $1,000/day in eval costs alone. I track this so I can make informed trade-offs between eval depth and budget.
The dashboard is only as good as the data feeding it. Here is the pipeline I use.
[CI Eval Run] -> [Results JSON] -> [Storage] -> [Dashboard API] -> [UI]
Before storing anything, you need tables. Here is the Supabase migration.
-- supabase/migrations/create_eval_tables.sql
create table eval_runs (
run_id uuid primary key default gen_random_uuid(),
timestamp timestamptz not null default now(),
prompt_version text not null,
model_id text not null,
faithfulness numeric(4,3) not null,
relevance numeric(4,3) not null,
factuality numeric(4,3) not null,
pass_rate numeric(4,3) not null,
total_cases integer not null,
avg_latency_ms integer,
avg_cost_per_query numeric(8,6),
eval_cost numeric(8,4),
raw_results jsonb,
created_at timestamptz default now()
);
create table eval_category_breakdowns (
id uuid primary key default gen_random_uuid(),
run_id uuid references eval_runs(run_id),
category text not null,
case_count integer not null,
avg_faithfulness numeric(4,3),
avg_relevance numeric(4,3),
avg_factuality numeric(4,3),
pass_rate numeric(4,3),
worst_case_input text,
worst_case_output text,
worst_case_score numeric(4,3),
failure_reason text,
timestamp timestamptz not null default now()
);
-- Index for time-series queries on the dashboard
create index idx_eval_runs_timestamp on eval_runs(timestamp desc);
create index idx_eval_breakdowns_run on eval_category_breakdowns(run_id);
-- RLS: only authenticated service role can write
alter table eval_runs enable row level security;
alter table eval_category_breakdowns enable row level security;
create policy "Service role can manage eval_runs"
on eval_runs for all
using (auth.role() = 'service_role');
create policy "Authenticated users can read eval_runs"
on eval_runs for select
using (auth.role() = 'authenticated');
create policy "Service role can manage breakdowns"
on eval_category_breakdowns for all
using (auth.role() = 'service_role');
create policy "Authenticated users can read breakdowns"
on eval_category_breakdowns for select
using (auth.role() = 'authenticated');
// scripts/store-eval-results.ts
import { createClient } from '@supabase/supabase-js';
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_KEY!
);
interface EvalRunRecord {
run_id: string;
timestamp: string;
prompt_version: string;
model_id: string;
faithfulness: number;
relevance: number;
factuality: number;
pass_rate: number;
total_cases: number;
avg_latency_ms: number;
avg_cost_per_query: number;
eval_cost: number;
raw_results: Record<string, unknown>;
}
async function storeEvalResults(record: EvalRunRecord) {
const { error } = await supabase
.from('eval_runs')
.insert(record);
if (error) throw new Error(`Failed to store eval: ${error.message}`);
}
// app/api/eval-dashboard/route.ts
import { createClient } from '@supabase/supabase-js';
import { NextResponse } from 'next/server';
export async function GET(request: Request) {
const { searchParams } = new URL(request.url);
const days = parseInt(searchParams.get('days') ?? '30');
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_KEY!
);
const since = new Date();
since.setDate(since.getDate() - days);
const { data: history } = await supabase
.from('eval_runs')
.select('*')
.gte('timestamp', since.toISOString())
.order('timestamp', { ascending: true });
const { data: latest } = await supabase
.from('eval_category_breakdowns')
.select('*')
.order('timestamp', { ascending: false })
.limit(20);
return NextResponse.json({
history: history ?? [],
latestBreakdown: latest ?? [],
regressionThreshold: 0.05,
});
}
Do not rely on people checking the dashboard. Set up alerts.
async function checkAndAlert(latestRun: EvalRunRecord,
baseline: EvalRunRecord) {
const checks = [
{
metric: 'faithfulness',
current: latestRun.faithfulness,
baseline: baseline.faithfulness,
},
{
metric: 'relevance',
current: latestRun.relevance,
baseline: baseline.relevance,
},
{
metric: 'factuality',
current: latestRun.factuality,
baseline: baseline.factuality,
},
];
const regressions = checks.filter(
(c) => c.current < c.baseline * 0.95
);
if (regressions.length > 0) {
await sendSlackAlert({
channel: '#ai-quality',
text: `Eval regression detected:\n${regressions
.map(
(r) =>
`- ${r.metric}: ${r.current.toFixed(3)} ` +
`(baseline: ${r.baseline.toFixed(3)})`
)
.join('\n')}`,
});
}
}
I have shown confidence dashboards to engineering managers, product leads, and executives. Here is what each group cares about:
| Stakeholder | Primary Panel | What They Want to Know | |---|---|---| | Engineers | Latest Eval Breakdown | "What broke and where?" | | Product Managers | Quality Over Time | "Is the feature ready to promote?" | | Executives | Quality + Cost | "Is this worth the investment?" |
Design for the product manager. They are the ones who decide whether to put the AI feature in front of more users. If they can see that quality is high and trending stable, they will push for wider rollout. That is how reliability drives adoption.
Anti-pattern 1: Dashboard without alerts. Nobody checks dashboards proactively. If a regression happens and nobody is notified, the dashboard is decoration.
Anti-pattern 2: Too many metrics. If you show 20 numbers, nobody reads any of them. Four panels, four questions. That is enough.
Anti-pattern 3: No annotations. A quality dip without context is just a scary line. Annotate model changes, prompt updates, and data refreshes so the team can correlate cause and effect.
Anti-pattern 4: Stale baselines. If your baseline is from six months ago and the system has improved significantly, every run looks green. Re-baseline regularly.
Here is the concrete checklist. Use Recharts, Chart.js, or Tremor for visualization -- any of them work with Next.js.
eval_runs and eval_category_breakdowns tables.deepeval test run or promptfoo eval, run ts-node scripts/store-eval-results.ts to push results to Supabase.app/api/eval-dashboard/route.ts using the code above./dashboard/evals:
LineChart).checkAndAlert function. Trigger after every CI eval run.# Add this step to your .github/workflows/llm-eval.yml
- name: Store results and check alerts
env:
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
SUPABASE_SERVICE_KEY: ${{ secrets.SUPABASE_SERVICE_KEY }}
SLACK_WEBHOOK: ${{ secrets.SLACK_EVAL_WEBHOOK }}
run: |
npx ts-node scripts/store-eval-results.ts \
--results results.json \
--baseline eval/baseline.json
The first time you show this dashboard in a sprint review, the conversation about your AI feature will change.
You have metrics, automated regression tests, and a dashboard. Next, we connect everything into the reliability flywheel -- the system-level loop that turns eval discipline into compounding adoption and shows you how to keep the whole system running quarter after quarter.