Observability for AI Systems
Your Datadog dashboard is green. Every health check passes.
Response times are within SLA. And your AI feature is
producing confident, well-formatted, completely wrong answers
to 12% of user queries. Nobody knows.
This is the gap that traditional observability was never
designed to close. I spent six months building monitoring
for LLM-powered systems in production, and the biggest
lesson was this: the tools we have work fine for the
infrastructure layer. They are useless for the intelligence
layer.
Why Traditional APM Fails for AI
Application Performance Monitoring was designed for a world
where code is deterministic. You send a request, you get a
response, and the response is either correct or your code
has a bug. You can write assertions. You can diff outputs.
You can alert on error codes.
AI systems break every one of those assumptions.
Same input, different output. Send the same prompt to
GPT-4o twice and you might get two different answers. Both
could be correct. Both could be wrong. The 200 OK status
code tells you nothing about whether the response was
actually useful.
Latency is a distribution, not a number. A 50-token
prompt might return in 800ms. A 2,000-token prompt with
a complex system message might take 8 seconds. Both are
"working correctly." Your p50 latency is meaningless if
your p99 is 10x higher because of prompt length variation.
"Success" is subjective. When a user asks "What is our
refund policy?" and the LLM responds with a plausible but
outdated policy from 2023, is that a success? Your HTTP
status says yes. Your customer says no. There is no error
code for "technically fluent but factually wrong."
I tried bolting LLM monitoring onto our existing Datadog
setup. We got pretty charts that told us the API was up.
We had no idea if the answers were good.
The Three Pillars of AI Observability
I landed on three pillars that actually matter for
production AI systems. Not metrics, logs, and traces --
those are implementation details. The pillars are what
you are actually trying to observe.
1. Token Economics
Every LLM call has a cost, and that cost varies wildly
based on model selection, prompt length, and output length.
If you are not tracking this per-call, you are flying blind
on unit economics.
I log five numbers on every single LLM call:
interface LLMCallMetrics {
model: string // 'gpt-4o' | 'claude-sonnet-4-20250514'
inputTokens: number // prompt + system message
outputTokens: number // completion length
costUsd: number // calculated from model pricing
cachedTokens: number // prompt cache hits (if supported)
}
Why each field matters:
- model: You will have multiple models in production.
Cost-per-query means nothing without knowing which model
served it.
- inputTokens: This is your lever for cost control.
Prompt engineering is really token engineering.
- outputTokens: The part you cannot fully control.
Set
max_tokens as a guardrail, not a target.
- costUsd: Pre-calculate this. Do not make your finance
team reverse-engineer it from the API bill.
- cachedTokens: Anthropic and OpenAI both support
prompt caching. If your cache hit rate is below 60% on
repeated system prompts, you are leaving money on the
table.
We track cost per feature, per user segment, and per model.
Our weekly cost report shows exactly where every dollar
goes. When one feature spiked 3x in cost overnight, we
caught it in the morning standup because the dashboard
flagged it -- not three weeks later on the invoice.
2. Quality Signals
This is the hard one. How do you measure whether an AI
response was "good"?
You cannot automate taste. But you can build proxies that
correlate with quality, and you can track them over time
to detect regressions.
Hallucination rate. If your system uses RAG, you can
measure whether the response is grounded in the retrieved
context. I use a lightweight classifier that checks whether
key claims in the response appear in the source documents.
It is not perfect, but it catches the obvious fabrications.
Our target: below 3% hallucination rate on factual queries.
Citation accuracy. When the system cites a source, is
that source real? Does it actually say what the system
claims it says? We log the citation URL, the claimed quote,
and a boolean for whether the quote exists in the source.
This is cheap to verify and catches a common failure mode.
User satisfaction signals. Thumbs up/down on responses.
Whether the user reformulated their query (a signal they
did not get what they needed). Session abandonment rate
after an AI interaction. Time-to-next-action after
receiving a response. None of these are perfect. Together,
they paint a picture.
Response consistency. For the same query class, how much
does the response vary? High variance on factual questions
is a red flag. High variance on creative tasks might be
fine. I hash the semantic structure of responses (not the
exact text) and track variance by query category.
We built a quality dashboard that shows these metrics on a
7-day rolling window. When hallucination rate crosses 5%,
we get a Slack alert. It has fired three times in six
months, and each time it caught a real regression --
usually a prompt change that inadvertently weakened
grounding instructions.
3. Latency Profiling
LLM latency is not one number. It is a stack of numbers,
and you need to decompose it to optimize it.
Total latency: 4,200ms
├── Embedding generation: 45ms
├── Vector search: 120ms
├── Reranking: 280ms
├── Context assembly: 15ms
├── LLM time-to-first-token: 890ms
├── LLM streaming: 2,800ms
└── Post-processing: 50ms
The breakdown matters because the optimization strategy
differs for each segment. Embedding latency? Cache it.
Vector search slow? Check your index configuration.
LLM streaming taking too long? Maybe your context window
is bloated and you need to trim retrieved chunks.
I track three latency percentiles per segment: p50, p95,
and p99. The p50 tells you about typical experience. The
p99 tells you about your worst-case users. In our system,
the p99 was 6x the p50 -- almost entirely because of
prompt length variation. We added a token budget that caps
context at 4,000 tokens and the p99 dropped by 40%.
What to Log on Every LLM Call
Here is the exact payload I attach to every LLM call in
production. Every field has earned its place through a
debugging session where I wished I had it.
interface LLMCallLog {
// Identity
traceId: string
spanId: string
parentSpanId: string | null
feature: string // 'search' | 'chat' | 'summary'
// Request
model: string
promptHash: string // SHA-256 of system + user prompt
inputTokens: number
temperature: number
maxTokens: number
// Response
outputTokens: number
responseHash: string // SHA-256 of completion
finishReason: string // 'stop' | 'length' | 'tool_use'
toolCalls: string[] // names of tools invoked
// Economics
costUsd: number
cachedTokens: number
// Timing
latencyMs: number
ttftMs: number // time to first token
// Quality
groundednessScore: number | null // 0-1, RAG only
userFeedback: 'positive' | 'negative' | null
}
The promptHash deserves special mention. I do not log
the full prompt -- that is a privacy and storage problem.
But I hash it so I can correlate quality regressions with
prompt changes. When hallucination rate spikes, I check
whether the prompt hash changed recently. It usually did.
The finishReason field has saved me twice. Once when a
prompt change caused 30% of responses to hit the token
limit and get truncated mid-sentence. The dashboard showed
finish_reason: 'length' spiking from 2% to 30% and we
caught it within an hour.
Trace Architecture for Agent Flows
Single LLM calls are straightforward. Agent flows -- where
one LLM call triggers tool use, which triggers another LLM
call, which triggers more tool use -- are where
observability gets interesting.
I use OpenTelemetry spans with a parent-child hierarchy
that mirrors the agent's execution:
Agent Trace (parent span)
├── Planning Step (child span)
│ ├── LLM Call: "What tools do I need?" (leaf span)
│ └── Decision: use [search, calculator]
├── Tool: search (child span)
│ ├── Query formulation (leaf span)
│ ├── API call to search service (leaf span)
│ └── Result parsing (leaf span)
├── Tool: calculator (child span)
│ └── Computation (leaf span)
└── Synthesis Step (child span)
├── LLM Call: "Combine results" (leaf span)
└── Response formatting (leaf span)
Each span carries the LLMCallLog fields when it involves
an LLM call. The parent span aggregates: total cost, total
latency, total tokens, number of LLM calls, number of tool
invocations.
This structure lets me answer questions that flat logs
cannot:
- "Which tool call is the bottleneck in this agent flow?"
- "What percentage of agent runs use more than 3 tool
calls?" (Ours: 15%. Those runs cost 4x more.)
- "When the agent retries a tool call, does the second
attempt succeed?" (Ours: 60% of the time. The other 40%
is wasted spend.)
The key implementation detail: propagate the trace context
through tool calls. If your search tool makes its own API
calls, those should be child spans of the tool span, not
orphaned traces. OpenTelemetry context propagation handles
this, but you have to wire it up deliberately.
Alert Design: What Is Worth Paging For
Not everything that is interesting is actionable. I learned
this the hard way after setting up 23 alerts and getting
paged for things that did not require immediate action.
Here is what survived the pruning:
Page-worthy (PagerDuty, immediate):
- Cost per hour exceeds 2x the 7-day rolling average.
This catches runaway loops, prompt injection attacks
that cause excessive token usage, and model upgrades
that silently increase per-token pricing.
- p99 latency exceeds 15 seconds. At this point, users
are abandoning. Something is broken.
- Error rate on LLM calls exceeds 5%. The provider is
having an outage or your API key is rate-limited.
Alert-worthy (Slack, business hours):
- Hallucination rate exceeds 5% on a 24-hour window.
Quality regression. Investigate prompt changes or
retrieval pipeline changes.
finish_reason: 'length' exceeds 10%. Responses are
getting truncated. Either prompts got longer or
max_tokens needs adjustment.
- Cache hit rate drops below 40%. You are paying for
redundant computation.
Dashboard-only (review weekly):
- Cost per feature trends. Useful for capacity planning,
not for incident response.
- Model distribution. What percentage of calls go to
each model? Shifting patterns indicate routing changes.
- User feedback ratio. Trends matter, daily fluctuations
do not.
The cost anomaly alert is the most valuable one we have.
It fired at 2 AM on a Tuesday when a deployment introduced
a bug that caused the agent to loop indefinitely. By the
time I woke up, the alert had triggered our automatic
circuit breaker that caps spend at $50/hour. Total damage:
$127 instead of what would have been $2,000+ by morning.
The Stack I Use
I am not going to pretend there is one perfect tool. Here
is what I actually run in production and why.
OpenTelemetry for trace collection. It is vendor-neutral,
it has good SDK support in TypeScript, and the span model
maps naturally to agent flows. I export to a collector that
fans out to multiple backends.
Supabase for cost and usage data. Every LLMCallLog
gets inserted into a Postgres table. I chose this over a
time-series database because I need to join LLM call data
with user data, feature flags, and business metrics. SQL
is the right tool when your queries involve JOINs. We run
a materialized view that aggregates cost by feature, model,
and hour. The dashboard queries the materialized view, not
the raw table.
CREATE MATERIALIZED VIEW llm_cost_hourly AS
SELECT
date_trunc('hour', created_at) AS hour,
feature,
model,
COUNT(*) AS call_count,
SUM(input_tokens) AS total_input_tokens,
SUM(output_tokens) AS total_output_tokens,
SUM(cost_usd) AS total_cost,
AVG(latency_ms) AS avg_latency,
PERCENTILE_CONT(0.99)
WITHIN GROUP (ORDER BY latency_ms) AS p99_latency
FROM llm_calls
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY 1, 2, 3;
Langfuse for quality tracking. It is purpose-built for
LLM observability and handles the things that general-purpose
tools do not: prompt versioning, evaluation scoring, and
session-level quality metrics. I send the quality signals
(groundedness score, user feedback, citation accuracy) to
Langfuse and use it for prompt A/B testing and regression
detection.
The split is intentional. Cost and latency are
infrastructure concerns -- they belong in the same system
as your other operational data. Quality is a product
concern -- it needs specialized tooling that understands
LLM-specific failure modes.
The Meta-Lesson
Six months into running AI observability in production,
the pattern I keep seeing is this: teams instrument the
easy stuff (latency, error rates, uptime) and ignore the
hard stuff (quality, cost efficiency, user satisfaction).
The easy stuff tells you whether your system is running.
The hard stuff tells you whether your system is working.
Those are different questions, and you need different tools
to answer them. Your APM dashboard is not going away. But
it is no longer sufficient. The probabilistic layer needs
its own observability stack, its own alert thresholds, and
its own on-call playbooks.
Start with three things: log every LLM call with the
fields I listed above, build a cost dashboard that updates
hourly, and add a hallucination rate metric to your quality
dashboard. You can do all three in a weekend. The rest --
trace architecture, alert tuning, quality classifiers --
can come later, informed by what you actually see in
production.
The worst observability is the kind you build after the
incident. The second worst is the kind that monitors the
wrong things. Aim for the third option: monitoring that
tells you what your users experience, not just what your
infrastructure reports.