Fine-Tuning vs RAG: The Decision Framework Nobody Talks About

Fine-Tuning vs RAG: The Decision Framework Nobody Talks About | Celestinosalim.com

Fine-Tuning vs RAG: The Decision Framework Nobody Talks About

"Should I fine-tune or use RAG?" I hear this question every week. From engineers at startups, from architects at enterprises, from CTOs trying to ship AI features. And almost every time, the question itself is wrong.

Fine-tuning and RAG are not alternatives. They are different tools that solve different problems. Asking "fine-tuning or RAG?" is like asking "should I use a database or an API?" The answer depends on what you are actually trying to do.

I have shipped production systems using fine-tuning alone, RAG alone, both together, and neither. The right choice is never about technology preference. It is about what problem you are solving, what your data looks like, and how much you want to spend.

What Fine-Tuning Actually Does

Fine-tuning changes how the model responds. It does not add new knowledge. Let me repeat that because it is the most misunderstood thing in applied AI: fine-tuning does not teach the model new facts.

What it does is modify the model's behavior. Tone, format, reasoning patterns, domain-specific conventions. When I fine-tuned a model for a legal tech company, the model did not learn new case law. It learned to write like a lawyer -- structured arguments, proper citations format, hedged conclusions.

Think of it like hiring. You hire someone with general intelligence, then train them on your company's style guide. They do not suddenly know your proprietary data. They know how to communicate the way you expect.

When fine-tuning works well

Consistent output format: You need JSON responses that match a specific schema every time
Domain voice: Medical, legal, or financial tone that prompt engineering cannot reliably produce
Reasoning patterns: You want the model to follow your company's decision framework, not a generic one
Latency reduction: A fine-tuned smaller model can match a larger model's quality at 3-5x lower latency

What fine-tuning cannot do

It cannot give the model access to your proprietary database. It cannot make the model aware of events after its training cutoff. It cannot reliably teach the model specific facts -- studies show fine-tuned models still hallucinate factual details at roughly the same rate as base models. The behavior changes. The knowledge does not.

What RAG Actually Does

RAG gives the model access to information at query time. It does not change how the model behaves. It changes what the model knows when answering a specific question.

The mechanics: a user asks a question, you search a knowledge base for relevant context, you stuff that context into the prompt alongside the question, and the model generates an answer grounded in the retrieved documents.

I built a RAG system for a fintech company that needed to answer questions about their 2,000-page regulatory compliance handbook. The model's behavior did not need to change. It already knew how to summarize and explain. It just needed access to the right information at the right time.

When RAG works well

Private or proprietary data: Company docs, internal wikis, customer records
Frequently updated information: Pricing, inventory, policies that change weekly
Audit trail requirements: You need to cite sources and show where answers came from
Large knowledge bases: Thousands of documents that would never fit in a fine-tuning dataset

What RAG cannot do

It cannot change the model's personality, output format, or reasoning style. If the base model writes like a chatbot and you need it to write like a radiologist, RAG will not fix that. You will get chatbot-style responses that happen to reference radiology documents.

The Decision Matrix

Here is the framework I use. It takes about five minutes to work through, and it has saved me from over-engineering more times than I can count.

| What You Need | Solution | Example | |---|---|---| | Domain-specific behavior or format | Fine-tune | Legal briefs in court format | | Access to current or private data | RAG | Q&A over internal docs | | Both behavior and knowledge | Fine-tune + RAG | Medical assistant with hospital records | | Neither (general task, public data) | Prompt engineering | Customer support chatbot |

The fourth row is the one people skip. Prompt engineering handles roughly 80% of the use cases I see in production. Before you reach for fine-tuning or RAG, spend a serious week on prompt engineering. I mean structured prompts with examples, chain-of-thought, output schemas. Not "you are a helpful assistant."

If prompt engineering gets you to 85% quality and your users are happy, ship it. You can always add fine-tuning or RAG later when you have real usage data telling you where quality falls short.

Cost Comparison: Real Numbers

This is where most blog posts wave their hands. Let me give you actual numbers from 2026 pricing so you can build a business case.

Fine-Tuning Costs

Training (one-time per model version):

GPT-4o mini fine-tuning: $3.00 per 1M training tokens
GPT-4o fine-tuning: $25.00 per 1M training tokens
A typical training dataset of 500 examples at ~1,000 tokens each = 500K tokens
Total training cost: $1.50 to $12.50 per run

That is surprisingly cheap. The hidden cost is in dataset curation. I spent 40 hours building a quality training set for a recent project. At engineering rates, that is $6,000-$10,000 of labor for the data, and $5 for the actual training.

Inference (ongoing):

GPT-4o mini fine-tuned: $0.30 per 1M input tokens, $1.20 per 1M output tokens
GPT-4o fine-tuned: $3.75 per 1M input tokens, $15.00 per 1M output tokens

Fine-tuned inference costs roughly 1.5x the base model. But if you fine-tuned a small model to match a large model's quality, you might save 5-10x on inference. That math works at scale.

RAG Costs

RAG has four cost components. All of them are ongoing.

Embedding generation:

OpenAI text-embedding-3-small: $0.02 per 1M tokens
At 10,000 queries/day with 50-token queries: ~$0.30/day

Vector database:

Pinecone Starter: free up to 100K vectors
Pinecone Standard: ~$70/month for 1M vectors
Supabase pgvector: $25/month (shared infra)
Self-hosted Qdrant: $50-200/month (compute costs)

Retrieval latency:

Vector search: 20-100ms per query
Re-ranking (if used): 50-200ms additional
This adds up. If your base LLM response is 500ms, RAG adds 15-40% latency overhead.

LLM inference with context:

Retrieved chunks add 500-3,000 tokens per query
At GPT-4o pricing ($2.50 per 1M input tokens), that is $0.001-$0.008 per query just for the retrieved context
At 10,000 queries/day: $10-$80/day in added context

Total RAG cost at 10,000 queries/day: $300-$2,500/month depending on your architecture choices. I covered how to bring this down in my RAG cost optimization post.

The Comparison

For a system handling 10,000 queries per day:

| Component | Fine-Tune Only | RAG Only | Both | |---|---|---|---| | Setup cost | $5K-$15K (data + training) | $1K-$5K (pipeline) | $10K-$20K | | Monthly infra | $0 | $100-$300 | $100-$300 | | Monthly inference | $400-$1,200 | $600-$2,500 | $500-$1,500 | | Maintenance | Low (retrain quarterly) | Medium (index updates) | High |

Fine-tuning has higher upfront cost but lower ongoing cost. RAG has lower upfront cost but ongoing infra and maintenance overhead. The hybrid is the most expensive to build and maintain -- only use it when you genuinely need both behavior change and knowledge access.

The Hybrid Pattern

Sometimes you actually need both. I built a hybrid system for a healthcare company that needed a model to: (a) write clinical notes in a specific format with proper medical terminology, and (b) reference patient-specific records and treatment protocols.

Fine-tuning handled the behavior: consistent note format, appropriate medical language, structured assessment sections. RAG handled the knowledge: patient history, current medications, relevant clinical guidelines.

Architecture

The flow looks like this:

User Query
    |
    v
[Embedding Model] --> [Vector DB Search]
    |                       |
    |                  Retrieved Context
    |                       |
    v                       v
[Fine-Tuned LLM] <-- [Prompt Template]
    |
    v
Formatted Response

The fine-tuned model receives the retrieved context through a prompt template. It already knows how to format clinical notes. The RAG pipeline gives it the specific patient data to reference. Each component does what it is good at.

Implementation order matters

If you are building a hybrid system, build RAG first. Here is why:

RAG gives you immediate value -- users can query their data right away
RAG usage data tells you where behavior needs to change, which informs your fine-tuning dataset
You can evaluate whether fine-tuning is actually needed before investing in dataset curation

I have seen teams spend months fine-tuning a model only to discover that a well-structured prompt template with RAG-retrieved examples solved the same problem at one-tenth the effort.

When to Skip Both

This is the most useful section of this post, and the one nobody wants to hear.

Prompt engineering handles the majority of use cases. Not 50%. Not 60%. I estimate 80% of the production AI systems I have reviewed would work fine with thoughtful prompt engineering alone.

Here is what "thoughtful prompt engineering" means:

System prompts with explicit constraints: Not "you are a helpful assistant" but a 200-line system prompt with output format, error handling, edge cases, and examples
Few-shot examples: 3-5 examples of ideal input/output pairs in the prompt
Output schemas: Structured output with JSON schema validation, not free-form text
Chain-of-thought: Explicit reasoning steps for complex queries
Guardrails: Input validation, output filtering, and fallback responses

A well-engineered prompt system costs $0 in infrastructure, takes days instead of months to build, and can be iterated on without retraining anything. The tradeoff is higher per-query token cost (longer prompts) and less consistency on edge cases.

The decision checklist

Before you start building fine-tuning or RAG:

Have you spent at least a week on prompt engineering? Not an afternoon. A week.
Do you have quantitative evals showing prompt engineering fails? "It does not feel right" is not a measurement.
Do you have at least 200 labeled examples? You need these for fine-tuning. If you do not have them, you do not have enough data to fine-tune.
Is your knowledge base larger than what fits in a context window? Modern models accept 128K-1M tokens. Maybe you do not need RAG at all.
Have you calculated the unit economics? Know your cost per query before and after, not just whether it "works."

The Framework in Practice

I recently consulted on a system for an e-commerce company. They wanted an AI that could:

Answer questions about 50,000 products
Respond in the brand's casual, witty tone
Handle returns, sizing, and shipping queries
Reference real-time inventory

Their initial plan was to fine-tune GPT-4o on their brand voice and build a RAG system for product data. Estimated build time: 3 months. Estimated monthly cost: $4,000.

What we actually shipped:

A detailed system prompt with brand voice examples and 5 few-shot conversations (prompt engineering)
RAG over their product catalog and FAQ database (needed for 50,000 products and real-time inventory)
No fine-tuning

Build time: 3 weeks. Monthly cost: $800. The brand voice was consistent enough with few-shot examples that fine-tuning was not worth the maintenance burden. We saved two months of engineering time and $3,200 per month by asking "do we actually need this?" before building.

The Bottom Line

The decision is not fine-tuning vs RAG. The decision is: what problem am I solving, and what is the simplest architecture that solves it?

Start with prompt engineering. Add RAG if you need external knowledge. Add fine-tuning if you need behavioral change that prompts cannot achieve. Build the hybrid only when you have proven you need both.

The best architecture is the one you do not over-build. Ship the simplest thing that works, measure it in production, and add complexity only when the data tells you to.

Fine-Tuning vs RAG: The Decision Framework Nobody Talks About

Discussion

Postgres Is All You Need: pgvector as Production AI Infrastructure

The Vendor Off-Ramp: How I Cut $60K/Month in AI Spend Without Rewriting the Stack