Observability for AI Systems: What to Log When Everything Is Probabilistic

Observability for AI Systems: What to Log When Everything Is Probabilistic | Celestinosalim.com

Observability for AI Systems

Your Datadog dashboard is green. Every health check passes. Response times are within SLA. And your AI feature is producing confident, well-formatted, completely wrong answers to 12% of user queries. Nobody knows.

This is the gap that traditional observability was never designed to close. I spent six months building monitoring for LLM-powered systems in production, and the biggest lesson was this: the tools we have work fine for the infrastructure layer. They are useless for the intelligence layer.

Why Traditional APM Fails for AI

Application Performance Monitoring was designed for a world where code is deterministic. You send a request, you get a response, and the response is either correct or your code has a bug. You can write assertions. You can diff outputs. You can alert on error codes.

AI systems break every one of those assumptions.

Same input, different output. Send the same prompt to GPT-4o twice and you might get two different answers. Both could be correct. Both could be wrong. The 200 OK status code tells you nothing about whether the response was actually useful.

Latency is a distribution, not a number. A 50-token prompt might return in 800ms. A 2,000-token prompt with a complex system message might take 8 seconds. Both are "working correctly." Your p50 latency is meaningless if your p99 is 10x higher because of prompt length variation.

"Success" is subjective. When a user asks "What is our refund policy?" and the LLM responds with a plausible but outdated policy from 2023, is that a success? Your HTTP status says yes. Your customer says no. There is no error code for "technically fluent but factually wrong."

I tried bolting LLM monitoring onto our existing Datadog setup. We got pretty charts that told us the API was up. We had no idea if the answers were good.

The Three Pillars of AI Observability

I landed on three pillars that actually matter for production AI systems. Not metrics, logs, and traces -- those are implementation details. The pillars are what you are actually trying to observe.

1. Token Economics

Every LLM call has a cost, and that cost varies wildly based on model selection, prompt length, and output length. If you are not tracking this per-call, you are flying blind on unit economics.

I log five numbers on every single LLM call:

interface LLMCallMetrics {
  model: string           // 'gpt-4o' | 'claude-sonnet-4-20250514'
  inputTokens: number     // prompt + system message
  outputTokens: number    // completion length
  costUsd: number         // calculated from model pricing
  cachedTokens: number    // prompt cache hits (if supported)
}

Why each field matters:

model: You will have multiple models in production. Cost-per-query means nothing without knowing which model served it.
inputTokens: This is your lever for cost control. Prompt engineering is really token engineering.
outputTokens: The part you cannot fully control. Set max_tokens as a guardrail, not a target.
costUsd: Pre-calculate this. Do not make your finance team reverse-engineer it from the API bill.
cachedTokens: Anthropic and OpenAI both support prompt caching. If your cache hit rate is below 60% on repeated system prompts, you are leaving money on the table.

We track cost per feature, per user segment, and per model. Our weekly cost report shows exactly where every dollar goes. When one feature spiked 3x in cost overnight, we caught it in the morning standup because the dashboard flagged it -- not three weeks later on the invoice.

2. Quality Signals

This is the hard one. How do you measure whether an AI response was "good"?

You cannot automate taste. But you can build proxies that correlate with quality, and you can track them over time to detect regressions.

Hallucination rate. If your system uses RAG, you can measure whether the response is grounded in the retrieved context. I use a lightweight classifier that checks whether key claims in the response appear in the source documents. It is not perfect, but it catches the obvious fabrications. Our target: below 3% hallucination rate on factual queries.

Citation accuracy. When the system cites a source, is that source real? Does it actually say what the system claims it says? We log the citation URL, the claimed quote, and a boolean for whether the quote exists in the source. This is cheap to verify and catches a common failure mode.

User satisfaction signals. Thumbs up/down on responses. Whether the user reformulated their query (a signal they did not get what they needed). Session abandonment rate after an AI interaction. Time-to-next-action after receiving a response. None of these are perfect. Together, they paint a picture.

Response consistency. For the same query class, how much does the response vary? High variance on factual questions is a red flag. High variance on creative tasks might be fine. I hash the semantic structure of responses (not the exact text) and track variance by query category.

We built a quality dashboard that shows these metrics on a 7-day rolling window. When hallucination rate crosses 5%, we get a Slack alert. It has fired three times in six months, and each time it caught a real regression -- usually a prompt change that inadvertently weakened grounding instructions.

3. Latency Profiling

LLM latency is not one number. It is a stack of numbers, and you need to decompose it to optimize it.

Total latency: 4,200ms
├── Embedding generation:    45ms
├── Vector search:          120ms
├── Reranking:              280ms
├── Context assembly:        15ms
├── LLM time-to-first-token: 890ms
├── LLM streaming:        2,800ms
└── Post-processing:         50ms

The breakdown matters because the optimization strategy differs for each segment. Embedding latency? Cache it. Vector search slow? Check your index configuration. LLM streaming taking too long? Maybe your context window is bloated and you need to trim retrieved chunks.

I track three latency percentiles per segment: p50, p95, and p99. The p50 tells you about typical experience. The p99 tells you about your worst-case users. In our system, the p99 was 6x the p50 -- almost entirely because of prompt length variation. We added a token budget that caps context at 4,000 tokens and the p99 dropped by 40%.

What to Log on Every LLM Call

Here is the exact payload I attach to every LLM call in production. Every field has earned its place through a debugging session where I wished I had it.

interface LLMCallLog {
  // Identity
  traceId: string
  spanId: string
  parentSpanId: string | null
  feature: string          // 'search' | 'chat' | 'summary'

  // Request
  model: string
  promptHash: string       // SHA-256 of system + user prompt
  inputTokens: number
  temperature: number
  maxTokens: number

  // Response
  outputTokens: number
  responseHash: string     // SHA-256 of completion
  finishReason: string     // 'stop' | 'length' | 'tool_use'
  toolCalls: string[]      // names of tools invoked

  // Economics
  costUsd: number
  cachedTokens: number

  // Timing
  latencyMs: number
  ttftMs: number           // time to first token

  // Quality
  groundednessScore: number | null  // 0-1, RAG only
  userFeedback: 'positive' | 'negative' | null
}

The promptHash deserves special mention. I do not log the full prompt -- that is a privacy and storage problem. But I hash it so I can correlate quality regressions with prompt changes. When hallucination rate spikes, I check whether the prompt hash changed recently. It usually did.

The finishReason field has saved me twice. Once when a prompt change caused 30% of responses to hit the token limit and get truncated mid-sentence. The dashboard showed finish_reason: 'length' spiking from 2% to 30% and we caught it within an hour.

Trace Architecture for Agent Flows

Single LLM calls are straightforward. Agent flows -- where one LLM call triggers tool use, which triggers another LLM call, which triggers more tool use -- are where observability gets interesting.

I use OpenTelemetry spans with a parent-child hierarchy that mirrors the agent's execution:

Agent Trace (parent span)
├── Planning Step (child span)
│   ├── LLM Call: "What tools do I need?" (leaf span)
│   └── Decision: use [search, calculator]
├── Tool: search (child span)
│   ├── Query formulation (leaf span)
│   ├── API call to search service (leaf span)
│   └── Result parsing (leaf span)
├── Tool: calculator (child span)
│   └── Computation (leaf span)
└── Synthesis Step (child span)
    ├── LLM Call: "Combine results" (leaf span)
    └── Response formatting (leaf span)

Each span carries the LLMCallLog fields when it involves an LLM call. The parent span aggregates: total cost, total latency, total tokens, number of LLM calls, number of tool invocations.

This structure lets me answer questions that flat logs cannot:

"Which tool call is the bottleneck in this agent flow?"
"What percentage of agent runs use more than 3 tool calls?" (Ours: 15%. Those runs cost 4x more.)
"When the agent retries a tool call, does the second attempt succeed?" (Ours: 60% of the time. The other 40% is wasted spend.)

The key implementation detail: propagate the trace context through tool calls. If your search tool makes its own API calls, those should be child spans of the tool span, not orphaned traces. OpenTelemetry context propagation handles this, but you have to wire it up deliberately.

Alert Design: What Is Worth Paging For

Not everything that is interesting is actionable. I learned this the hard way after setting up 23 alerts and getting paged for things that did not require immediate action.

Here is what survived the pruning:

Page-worthy (PagerDuty, immediate):

Cost per hour exceeds 2x the 7-day rolling average. This catches runaway loops, prompt injection attacks that cause excessive token usage, and model upgrades that silently increase per-token pricing.
p99 latency exceeds 15 seconds. At this point, users are abandoning. Something is broken.
Error rate on LLM calls exceeds 5%. The provider is having an outage or your API key is rate-limited.

Alert-worthy (Slack, business hours):

Hallucination rate exceeds 5% on a 24-hour window. Quality regression. Investigate prompt changes or retrieval pipeline changes.
finish_reason: 'length' exceeds 10%. Responses are getting truncated. Either prompts got longer or max_tokens needs adjustment.
Cache hit rate drops below 40%. You are paying for redundant computation.

Dashboard-only (review weekly):

Cost per feature trends. Useful for capacity planning, not for incident response.
Model distribution. What percentage of calls go to each model? Shifting patterns indicate routing changes.
User feedback ratio. Trends matter, daily fluctuations do not.

The cost anomaly alert is the most valuable one we have. It fired at 2 AM on a Tuesday when a deployment introduced a bug that caused the agent to loop indefinitely. By the time I woke up, the alert had triggered our automatic circuit breaker that caps spend at $50/hour. Total damage: $127 instead of what would have been $2,000+ by morning.

The Stack I Use

I am not going to pretend there is one perfect tool. Here is what I actually run in production and why.

OpenTelemetry for trace collection. It is vendor-neutral, it has good SDK support in TypeScript, and the span model maps naturally to agent flows. I export to a collector that fans out to multiple backends.

Supabase for cost and usage data. Every LLMCallLog gets inserted into a Postgres table. I chose this over a time-series database because I need to join LLM call data with user data, feature flags, and business metrics. SQL is the right tool when your queries involve JOINs. We run a materialized view that aggregates cost by feature, model, and hour. The dashboard queries the materialized view, not the raw table.

CREATE MATERIALIZED VIEW llm_cost_hourly AS
SELECT
  date_trunc('hour', created_at) AS hour,
  feature,
  model,
  COUNT(*) AS call_count,
  SUM(input_tokens) AS total_input_tokens,
  SUM(output_tokens) AS total_output_tokens,
  SUM(cost_usd) AS total_cost,
  AVG(latency_ms) AS avg_latency,
  PERCENTILE_CONT(0.99)
    WITHIN GROUP (ORDER BY latency_ms) AS p99_latency
FROM llm_calls
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY 1, 2, 3;

Langfuse for quality tracking. It is purpose-built for LLM observability and handles the things that general-purpose tools do not: prompt versioning, evaluation scoring, and session-level quality metrics. I send the quality signals (groundedness score, user feedback, citation accuracy) to Langfuse and use it for prompt A/B testing and regression detection.

The split is intentional. Cost and latency are infrastructure concerns -- they belong in the same system as your other operational data. Quality is a product concern -- it needs specialized tooling that understands LLM-specific failure modes.

The Meta-Lesson

Six months into running AI observability in production, the pattern I keep seeing is this: teams instrument the easy stuff (latency, error rates, uptime) and ignore the hard stuff (quality, cost efficiency, user satisfaction).

The easy stuff tells you whether your system is running. The hard stuff tells you whether your system is working.

Those are different questions, and you need different tools to answer them. Your APM dashboard is not going away. But it is no longer sufficient. The probabilistic layer needs its own observability stack, its own alert thresholds, and its own on-call playbooks.

Start with three things: log every LLM call with the fields I listed above, build a cost dashboard that updates hourly, and add a hallucination rate metric to your quality dashboard. You can do all three in a weekend. The rest -- trace architecture, alert tuning, quality classifiers -- can come later, informed by what you actually see in production.

The worst observability is the kind you build after the incident. The second worst is the kind that monitors the wrong things. Aim for the third option: monitoring that tells you what your users experience, not just what your infrastructure reports.

Observability for AI Systems: What to Log When Everything Is Probabilistic

Discussion

Guardrails Are Not Optional: A Production Safety Implementation Guide

Evals Are the Unit Tests of AI: A Production Playbook

Systems Thinking for AI Engineers: Why the Model Is Never the Problem