Unit Economics for AI Products | Production AI Architecture | Celestinosalim.com

Unit Economics for AI Products

The Profitability Problem Nobody Talks About

Here is a scenario I encounter regularly: a startup launches an AI feature, users love it, usage grows, and then the finance team calls an emergency meeting. The AI feature that was supposed to be a competitive advantage is now the single largest line item on the infrastructure bill. The margin on every AI-assisted interaction is negative.

This is not a technology problem. It is an economics problem. And it is solvable -- but only if you model the economics before you scale, not after.

In my experience, the teams that succeed with AI in production are the ones that treat cost as an architectural constraint, not an afterthought. Just as a hardware engineer designs a circuit within a power budget, I architect AI systems within a cost budget.

The Unit Economics Framework

Unit economics for AI is straightforward in concept: every AI-powered interaction has a cost and a value. Your job is to ensure that value exceeds cost at every scale.

UNIT ECONOMICS FOR AI INTERACTIONS
═══════════════════════════════════

Revenue per interaction:  What does this interaction earn?
                          (subscription allocation, transaction fee,
                           ad revenue, cost avoidance)

Cost per interaction:     What does this interaction cost?
                          (LLM API tokens + compute + storage +
                           human review + infrastructure overhead)

Margin per interaction:   Revenue - Cost
                          Must be positive at target scale.

Break-even volume:        Fixed costs / margin per interaction
                          How many interactions to cover your
                          infrastructure and team costs.

A Concrete Example

Let us say you run an AI customer support system:

CUSTOMER SUPPORT AI -- UNIT ECONOMICS
──────────────────────────────────────
Revenue side:
  Average support ticket cost (human):     $12.00
  AI handles ticket autonomously:          $12.00 saved
  AI assists human (50% faster):           $6.00 saved

Cost side:
  Average tokens per ticket:               3,200 in / 800 out
  Model (Claude 3.5 Sonnet):
    Input:  3,200 * $0.003/1K  =           $0.0096
    Output:   800 * $0.015/1K  =           $0.0120
  RAG retrieval (embedding + search):      $0.002
  Infrastructure overhead (20%):           $0.005
  Total cost per ticket:                   $0.0286

Margin:
  Autonomous resolution:  $12.00 - $0.03 = $11.97  (99.8% margin)
  Human-assisted:          $6.00 - $0.03 =  $5.97  (99.5% margin)

  At 10,000 tickets/month:
    AI cost:    $286/month
    Value:      $120,000/month (if all autonomous)
    Realistic:  $72,000/month (60% autonomous, 40% assisted)

The margins look spectacular -- until you factor in the hidden costs.

The Hidden Cost Multipliers

The per-token API cost is the most visible expense, but in my experience, it typically represents only 30-50% of the true cost of running AI in production. Here are the multipliers most teams miss:

1. Retry and Fallback Costs

When your primary model returns a low-quality response or times out, the retry hits your budget twice. If the fallback is a more expensive model, it hits harder. I model this as a failure tax:

Effective cost = base_cost * (1 + failure_rate * retry_multiplier)

Example:
  Base cost per call:        $0.03
  Failure rate:              5%
  Retry multiplier:          1.5x (fallback model costs more)
  Effective cost:            $0.03 * (1 + 0.05 * 1.5)
                           = $0.03 * 1.075
                           = $0.032

At 5% failure rate, the impact is small. At 15%, it is material. I have seen systems with 20%+ effective failure rates because nobody measured it.

2. Prompt Engineering Overhead

Long system prompts are expensive at scale. A 2,000-token system prompt on every request at 100K requests/day:

2,000 tokens * $0.003/1K * 100,000 requests = $600/day = $18,000/month

This is why prompt caching matters. Anthropic's prompt caching reduces cached input token costs by up to 90%. That $18,000/month becomes $1,800/month with effective caching -- a savings that goes straight to the bottom line.

3. Evaluation and Monitoring Costs

Quality monitoring requires running eval suites, sampling production outputs, and sometimes using a second LLM as a judge. These costs are real and recurring:

Weekly eval suite:     500 samples * $0.03/sample = $15/week
LLM-as-judge:         500 samples * $0.05/judge   = $25/week
Monthly monitoring:    $160/month

Not expensive in absolute terms, but it needs to be in the budget.

4. The Human-in-the-Loop Tax

If your system requires human review for a percentage of outputs (and for high-stakes applications, it should), that human time is the most expensive component:

Human review rate:         10% of interactions
Human review cost:         $2.00 per review (5 minutes at $24/hr)
At 10,000 interactions:    1,000 reviews * $2.00 = $2,000/month

Suddenly the human review cost is 7x the LLM API cost.

Cost Optimization Strategies (Ordered by Impact)

I prioritize these by return on engineering effort:

Strategy 1: Prompt Caching (Highest Impact)

Prompt caching stores the processed system prompt on the provider's servers, so subsequent requests only send the variable portion. Results from production systems I have architected:

Anthropic prompt caching: 90% reduction on cached input tokens
OpenAI automatic caching: 50% reduction, enabled by default
Combined with long system prompts: 40-60% reduction in total input costs

Implementation is often a single configuration flag. This is the best effort-to-savings ratio in AI cost optimization.

Strategy 2: Model Routing (High Impact)

Not every request needs the most capable model. I implement a routing layer that matches request complexity to model capability:

def route_request(request: LLMRequest) -> str:
    complexity = estimate_complexity(request)

    if complexity == "simple":
        # Lookups, formatting, simple extraction
        return "claude-3.5-haiku"      # $0.0008/1K input
    elif complexity == "moderate":
        # Summarization, standard generation
        return "claude-sonnet-4"       # $0.003/1K input
    else:
        # Complex reasoning, multi-step analysis
        return "claude-opus-4"         # $0.015/1K input

    # 60-70% of requests route to the cheapest tier
    # Only 5-10% need the most expensive model

In my experience, proper model routing reduces costs by 30-50% with negligible quality impact on routed-down requests.

Strategy 3: Prompt Compression (Medium Impact)

Tools like LLMLingua can compress prompts by up to 20x while preserving semantic meaning. For RAG-heavy systems where context retrieval pulls in thousands of tokens, this is significant:

Before compression:  8,000 context tokens per request
After compression:   2,000 context tokens per request
Cost reduction:      75% on context tokens

Strategy 4: Semantic Caching (Medium Impact)

If users ask similar questions repeatedly, cache the responses. Not just exact-match caching -- semantic caching that recognizes "What are your return policies?" and "How do I return an item?" should hit the same cache entry.

Cache hit rate (typical):  15-30% for customer-facing applications
Cost reduction:            15-30% of total LLM spend
Added benefit:             Sub-100ms response time for cached results

Building a Cost Dashboard

I architect every production AI system with a real-time cost dashboard. Here are the metrics that matter:

COST DASHBOARD -- ESSENTIAL METRICS
════════════════════════════════════
Real-time:
  ├── Cost per hour (current burn rate)
  ├── Cost per interaction (trailing 1hr average)
  ├── Token usage by model tier
  └── Cache hit rate

Daily:
  ├── Total spend by model
  ├── Cost per feature/use-case
  ├── Margin per interaction type
  └── Anomaly detection (spend spikes)

Weekly:
  ├── Cost trend (week-over-week)
  ├── Unit economics health check
  ├── Model routing distribution
  └── Optimization opportunity report

The anomaly detection is critical. I set alerts at 2x the expected hourly spend. This catches runaway retry loops, prompt injection attacks that inflate token usage, and sudden traffic spikes before they become budget emergencies.

The $60K/Month Lesson

One of the most impactful cost engineering exercises I led involved an AI system that was spending $85K/month on LLM API calls. The system had been built during experimentation, when cost was not a constraint, and had carried that architecture into production.

Through systematic application of these strategies -- prompt caching on the long system prompts, model routing to send 65% of requests to a smaller model, and semantic caching for the 20% most common query patterns -- we reduced the monthly spend to $25K. That is $60K/month in savings, or $720K/year, without any degradation in user-facing quality metrics.

The key insight: the savings came from architecture, not from cutting corners. The system was better after optimization because the constraints forced clearer thinking about what each component actually needed.

Optimization Strategy Trade-Offs

| Strategy | Effort to Implement | Cost Reduction | Quality Risk | When to Skip | |----------|---------------------|----------------|--------------|-------------| | Prompt caching | Low (config flag) | 40-60% on input tokens | None | Short, unique prompts with no reusable system context | | Model routing | Medium (classifier + routing logic) | 30-50% overall | Low if routing is accurate | Single use case where all requests need the same capability | | Prompt compression | Medium (integration + testing) | Up to 75% on context tokens | Medium -- lossy compression can degrade reasoning | Tasks requiring exact-quote retrieval or legal precision | | Semantic caching | High (embedding pipeline + cache infra) | 15-30% overall | Low for stable domains, high for rapidly changing data | Domains where answers change frequently (live data, news) |

The order matters: implement prompt caching before investing in routing or compression. The effort-to-savings ratio drops sharply after the first two strategies.

Architecture Review Checklist

Before scaling any AI feature, verify:

[ ] Unit economics modeled: revenue per interaction exceeds total cost per interaction
[ ] Hidden cost multipliers accounted for: retries, prompt overhead, evaluation, human review
[ ] Cost dashboard deployed with real-time burn rate and anomaly detection
[ ] Spend alerts configured: per-request limit, 2x hourly threshold, daily ceiling
[ ] Prompt caching enabled where system prompts exceed 500 tokens
[ ] Model routing evaluated: can 60%+ of requests use a cheaper tier without quality loss?
[ ] Break-even volume calculated and compared to current and projected traffic
[ ] Cost per interaction tracked per model, per use case, per feature

Key Takeaways

Model unit economics before scaling: revenue per interaction minus total cost per interaction must be positive.
Per-token API cost is only 30-50% of the true cost. Account for retries, prompt overhead, evaluation, and human review.
Optimize in order of impact: prompt caching first, then model routing, then compression and semantic caching.
Build cost dashboards with real-time anomaly detection. A 2x hourly spend alert has saved me from budget emergencies multiple times.
Cost constraints improve architecture. The $60K/month savings came from better design, not from compromise.

AI that is not profitable is not viable. Viable AI starts with unit economics.

What's Next

You now know what your AI features cost and how to make them profitable. The next lesson addresses the strategic question: what happens when your vendor changes the pricing, deprecates the model, or has a six-hour outage? We build the vendor off-ramp pattern -- the three-layer architecture that lets you switch providers in hours, not months.