How I Cut RAG Costs by 99%
The retrieval architecture, cost playbook, and eval harness behind a 99% reduction in per-query RAG costs — from $4.85 to $0.05.
How I Cut RAG Costs by 99%
RAG demos are cheap. RAG at scale is not.
Most retrieval-augmented generation prototypes I've audited cost between $2 and $5 per query once you account for embedding generation, vector search, and the LLM call. That math seems fine in a notebook. Run 10,000 queries a day and you're staring at $1.5M a year in inference spend — before you've earned a dollar of revenue.
The gap between "it works in a demo" and "it works in production" is where teams lose months and money. This case study is the playbook I built to close that gap: the retrieval architecture, the cost levers, the eval harness, and the hard numbers.
The Retrieval Architecture
Every design choice here traces back to one question: does this decision improve unit economics or reliability? If it doesn't do either, it's complexity for its own sake.
Chunking — Parent-Child with Token Awareness. I chose parent-child chunking over fixed-size sliding windows. Why? Fixed-size chunks split sentences mid-thought, which tanks retrieval precision. Parent-child lets me retrieve a small child chunk for matching, then expand to the full parent for context. I also enforce token-aware boundaries so no chunk wastes tokens on incomplete sentences — every token in the context window earns its keep.
Embeddings — Right-Sized, Not Biggest.
I use text-embedding-3-small (1536 dimensions) for the initial
retrieval pass. It's roughly 5x cheaper than text-embedding-3-large
and, on our domain-specific benchmarks, retains 95% of the recall. The
larger model only fires during reranking, where it touches the top 20
candidates instead of the full corpus. This alone cut embedding costs
by 80%.
Hybrid Search — Dense + Sparse on Supabase.
Pure vector search misses exact terminology ("Error code 4012").
Pure keyword search misses semantic similarity ("billing problem" vs.
"invoice issue"). I run both in parallel — pgvector for dense retrieval
and pg_trgm for trigram-based keyword matching — then merge results
with reciprocal rank fusion. Supabase hosts both, which means one
managed Postgres instance instead of a separate vector database. Fewer
services, lower bill, simpler ops.
Reranking — Cross-Encoder as a Precision Filter. The initial retrieval over-fetches by design (top 100). A cross-encoder reranker scores each candidate against the query and prunes down to the top 5. This lifted precision significantly without touching the retrieval index itself. Think of it as a guardrail: cheap retrieval casts a wide net, expensive reranking narrows it.
The Cost Reduction Playbook
The 99% figure isn't one trick — it's five levers compounding.
1. Semantic Caching. I added a Redis cache layer keyed on query embeddings with a cosine similarity threshold of 0.97. If a new query is near-identical to a cached one, we return the cached result and skip both embedding generation and the LLM call entirely. In production, this hits a 60% cache rate because users ask variations of the same questions. That single lever cut costs by more than half.
2. Tiered Embedding Models. As described above: small model for first-pass retrieval, large model only for reranking the top-k. The large model sees 20 documents instead of 50,000. The cost difference is three orders of magnitude.
3. Tiered Storage. Not every embedding needs to live in hot pgvector memory. I partition by query frequency: documents queried in the last 30 days stay hot in pgvector, 30-90 days go to warm storage (compressed, still queryable), and anything older gets archived to cold S3. This cut our Supabase compute tier by 40%.
4. Batch Processing Off-Peak. New content gets embedded in nightly batch jobs instead of synchronously at ingest time. Off-peak compute is cheaper, and batching lets me deduplicate and optimize chunk boundaries before they hit the index.
5. Token-Aware Chunking. Every chunk is sized to maximize information density within the model's context window. No half-sentences, no padding. When your context window is the most expensive resource in the pipeline, waste is a direct cost leak.
The Eval Harness
Cost reduction means nothing if quality degrades. I needed a system that would catch regressions before users did.
Golden Dataset. I built a set of 500 query-answer pairs with human-labeled relevance scores. This is the ground truth — no synthetic shortcuts for the baseline.
Automated Metrics. Every pipeline change triggers a suite that measures:
- Retrieval quality: Recall@10, MRR (Mean Reciprocal Rank), NDCG
- Generation quality: Factuality, faithfulness, and relevance scored by an LLM-as-judge against the golden set
Regression Gate. If any metric drops more than 2% from the last blessed run, the deploy is blocked. No exceptions. This is the reliability guardrail that lets me move fast on cost optimizations without shipping quality regressions.
The Feedback Loop. Evals catch a regression. I diagnose whether it's a chunking issue, a retrieval issue, or a generation issue. Fix. Re-eval. Ship with confidence. This loop is what makes the system hardened — not any single component, but the discipline of measuring before and after every change.
Before and After
| Metric | Before | After | Change | |---|---|---|---| | Cost per query | $4.85 | $0.05 | -99% | | Recall@10 | 72% | 94% | +22 pts | | Latency (p50) | 2.8 s | 340 ms | -88% | | Latency (p95) | 8.1 s | 890 ms | -89% | | Hallucination rate | 14% | 3.2% | -77% | | Monthly infra cost (10K queries/day) | ~$145K | ~$1.5K | -99% |
The cost numbers are the headline, but the latency and accuracy gains matter just as much. Faster responses mean higher completion rates. Higher recall means fewer "I don't know" dead ends. Lower hallucination means users actually trust the output — and that trust is what drove the 482% engagement lift I reference elsewhere.
What I Learned
Caching is the highest-leverage cost lever. I expected the model tier optimizations to dominate. They didn't. Semantic caching alone was responsible for more than half the cost reduction because real user traffic is far more repetitive than synthetic benchmarks suggest.
Hybrid search isn't optional. I started with pure vector retrieval
and kept hitting edge cases — exact product codes, error numbers,
proper nouns. Adding sparse retrieval via pg_trgm fixed an entire
class of failures I was trying to solve with better embeddings.
Evals are infrastructure, not a nice-to-have. Every time I skipped the eval step to "move faster," I introduced a regression that took longer to debug than the eval would have taken to run. The harness paid for itself in the first week.
Supabase pgvector is underrated for this workload. I evaluated Pinecone and Weaviate. Both are excellent, but for a system where I already need Postgres for auth, RLS, and application data, adding a separate vector database introduced network hops, billing complexity, and one more service to monitor. Keeping everything in Supabase was the systems thinking move: fewer components, fewer failure modes.
If I did this again, I'd build the eval harness first, before writing a single line of retrieval code. Having ground-truth measurements from day one would have saved me two weeks of "does this feel better?" guessing.
