Start Lesson
Here is a scenario I encounter regularly: a startup launches an AI feature, users love it, usage grows, and then the finance team calls an emergency meeting. The AI feature that was supposed to be a competitive advantage is now the single largest line item on the infrastructure bill. The margin on every AI-assisted interaction is negative.
This is not a technology problem. It is an economics problem. And it is solvable -- but only if you model the economics before you scale, not after.
In my experience, the teams that succeed with AI in production are the ones that treat cost as an architectural constraint, not an afterthought. Just as a hardware engineer designs a circuit within a power budget, I architect AI systems within a cost budget.
Unit economics for AI is straightforward in concept: every AI-powered interaction has a cost and a value. Your job is to ensure that value exceeds cost at every scale.
UNIT ECONOMICS FOR AI INTERACTIONS
═══════════════════════════════════
Revenue per interaction: What does this interaction earn?
(subscription allocation, transaction fee,
ad revenue, cost avoidance)
Cost per interaction: What does this interaction cost?
(LLM API tokens + compute + storage +
human review + infrastructure overhead)
Margin per interaction: Revenue - Cost
Must be positive at target scale.
Break-even volume: Fixed costs / margin per interaction
How many interactions to cover your
infrastructure and team costs.
Let us say you run an AI customer support system:
CUSTOMER SUPPORT AI -- UNIT ECONOMICS
──────────────────────────────────────
Revenue side:
Average support ticket cost (human): $12.00
AI handles ticket autonomously: $12.00 saved
AI assists human (50% faster): $6.00 saved
Cost side:
Average tokens per ticket: 3,200 in / 800 out
Model (Claude 3.5 Sonnet):
Input: 3,200 * $0.003/1K = $0.0096
Output: 800 * $0.015/1K = $0.0120
RAG retrieval (embedding + search): $0.002
Infrastructure overhead (20%): $0.005
Total cost per ticket: $0.0286
Margin:
Autonomous resolution: $12.00 - $0.03 = $11.97 (99.8% margin)
Human-assisted: $6.00 - $0.03 = $5.97 (99.5% margin)
At 10,000 tickets/month:
AI cost: $286/month
Value: $120,000/month (if all autonomous)
Realistic: $72,000/month (60% autonomous, 40% assisted)
The margins look spectacular -- until you factor in the hidden costs.
The per-token API cost is the most visible expense, but in my experience, it typically represents only 30-50% of the true cost of running AI in production. Here are the multipliers most teams miss:
When your primary model returns a low-quality response or times out, the retry hits your budget twice. If the fallback is a more expensive model, it hits harder. I model this as a failure tax:
Effective cost = base_cost * (1 + failure_rate * retry_multiplier)
Example:
Base cost per call: $0.03
Failure rate: 5%
Retry multiplier: 1.5x (fallback model costs more)
Effective cost: $0.03 * (1 + 0.05 * 1.5)
= $0.03 * 1.075
= $0.032
At 5% failure rate, the impact is small. At 15%, it is material. I have seen systems with 20%+ effective failure rates because nobody measured it.
Long system prompts are expensive at scale. A 2,000-token system prompt on every request at 100K requests/day:
2,000 tokens * $0.003/1K * 100,000 requests = $600/day = $18,000/month
This is why prompt caching matters. Anthropic's prompt caching reduces cached input token costs by up to 90%. That $18,000/month becomes $1,800/month with effective caching -- a savings that goes straight to the bottom line.
Quality monitoring requires running eval suites, sampling production outputs, and sometimes using a second LLM as a judge. These costs are real and recurring:
Weekly eval suite: 500 samples * $0.03/sample = $15/week
LLM-as-judge: 500 samples * $0.05/judge = $25/week
Monthly monitoring: $160/month
Not expensive in absolute terms, but it needs to be in the budget.
If your system requires human review for a percentage of outputs (and for high-stakes applications, it should), that human time is the most expensive component:
Human review rate: 10% of interactions
Human review cost: $2.00 per review (5 minutes at $24/hr)
At 10,000 interactions: 1,000 reviews * $2.00 = $2,000/month
Suddenly the human review cost is 7x the LLM API cost.
I prioritize these by return on engineering effort:
Prompt caching stores the processed system prompt on the provider's servers, so subsequent requests only send the variable portion. Results from production systems I have architected:
Implementation is often a single configuration flag. This is the best effort-to-savings ratio in AI cost optimization.
Not every request needs the most capable model. I implement a routing layer that matches request complexity to model capability:
def route_request(request: LLMRequest) -> str:
complexity = estimate_complexity(request)
if complexity == "simple":
# Lookups, formatting, simple extraction
return "claude-3.5-haiku" # $0.0008/1K input
elif complexity == "moderate":
# Summarization, standard generation
return "claude-sonnet-4" # $0.003/1K input
else:
# Complex reasoning, multi-step analysis
return "claude-opus-4" # $0.015/1K input
# 60-70% of requests route to the cheapest tier
# Only 5-10% need the most expensive model
In my experience, proper model routing reduces costs by 30-50% with negligible quality impact on routed-down requests.
Tools like LLMLingua can compress prompts by up to 20x while preserving semantic meaning. For RAG-heavy systems where context retrieval pulls in thousands of tokens, this is significant:
Before compression: 8,000 context tokens per request
After compression: 2,000 context tokens per request
Cost reduction: 75% on context tokens
If users ask similar questions repeatedly, cache the responses. Not just exact-match caching -- semantic caching that recognizes "What are your return policies?" and "How do I return an item?" should hit the same cache entry.
Cache hit rate (typical): 15-30% for customer-facing applications
Cost reduction: 15-30% of total LLM spend
Added benefit: Sub-100ms response time for cached results
I architect every production AI system with a real-time cost dashboard. Here are the metrics that matter:
COST DASHBOARD -- ESSENTIAL METRICS
════════════════════════════════════
Real-time:
├── Cost per hour (current burn rate)
├── Cost per interaction (trailing 1hr average)
├── Token usage by model tier
└── Cache hit rate
Daily:
├── Total spend by model
├── Cost per feature/use-case
├── Margin per interaction type
└── Anomaly detection (spend spikes)
Weekly:
├── Cost trend (week-over-week)
├── Unit economics health check
├── Model routing distribution
└── Optimization opportunity report
The anomaly detection is critical. I set alerts at 2x the expected hourly spend. This catches runaway retry loops, prompt injection attacks that inflate token usage, and sudden traffic spikes before they become budget emergencies.
One of the most impactful cost engineering exercises I led involved an AI system that was spending $85K/month on LLM API calls. The system had been built during experimentation, when cost was not a constraint, and had carried that architecture into production.
Through systematic application of these strategies -- prompt caching on the long system prompts, model routing to send 65% of requests to a smaller model, and semantic caching for the 20% most common query patterns -- we reduced the monthly spend to $25K. That is $60K/month in savings, or $720K/year, without any degradation in user-facing quality metrics.
The key insight: the savings came from architecture, not from cutting corners. The system was better after optimization because the constraints forced clearer thinking about what each component actually needed.
| Strategy | Effort to Implement | Cost Reduction | Quality Risk | When to Skip | |----------|---------------------|----------------|--------------|-------------| | Prompt caching | Low (config flag) | 40-60% on input tokens | None | Short, unique prompts with no reusable system context | | Model routing | Medium (classifier + routing logic) | 30-50% overall | Low if routing is accurate | Single use case where all requests need the same capability | | Prompt compression | Medium (integration + testing) | Up to 75% on context tokens | Medium -- lossy compression can degrade reasoning | Tasks requiring exact-quote retrieval or legal precision | | Semantic caching | High (embedding pipeline + cache infra) | 15-30% overall | Low for stable domains, high for rapidly changing data | Domains where answers change frequently (live data, news) |
The order matters: implement prompt caching before investing in routing or compression. The effort-to-savings ratio drops sharply after the first two strategies.
Before scaling any AI feature, verify:
AI that is not profitable is not viable. Viable AI starts with unit economics.
You now know what your AI features cost and how to make them profitable. The next lesson addresses the strategic question: what happens when your vendor changes the pricing, deprecates the model, or has a six-hour outage? We build the vendor off-ramp pattern -- the three-layer architecture that lets you switch providers in hours, not months.