Fine-Tuning vs RAG: The Decision Framework Nobody Talks About
"Should I fine-tune or use RAG?" I hear this question
every week. From engineers at startups, from architects
at enterprises, from CTOs trying to ship AI features. And
almost every time, the question itself is wrong.
Fine-tuning and RAG are not alternatives. They are
different tools that solve different problems. Asking
"fine-tuning or RAG?" is like asking "should I use a
database or an API?" The answer depends on what you are
actually trying to do.
I have shipped production systems using fine-tuning alone,
RAG alone, both together, and neither. The right choice
is never about technology preference. It is about what
problem you are solving, what your data looks like, and
how much you want to spend.
What Fine-Tuning Actually Does
Fine-tuning changes how the model responds. It does
not add new knowledge. Let me repeat that because it is
the most misunderstood thing in applied AI: fine-tuning
does not teach the model new facts.
What it does is modify the model's behavior. Tone,
format, reasoning patterns, domain-specific conventions.
When I fine-tuned a model for a legal tech company, the
model did not learn new case law. It learned to write
like a lawyer -- structured arguments, proper citations
format, hedged conclusions.
Think of it like hiring. You hire someone with general
intelligence, then train them on your company's style
guide. They do not suddenly know your proprietary data.
They know how to communicate the way you expect.
When fine-tuning works well
- Consistent output format: You need JSON responses
that match a specific schema every time
- Domain voice: Medical, legal, or financial tone
that prompt engineering cannot reliably produce
- Reasoning patterns: You want the model to follow
your company's decision framework, not a generic one
- Latency reduction: A fine-tuned smaller model can
match a larger model's quality at 3-5x lower latency
What fine-tuning cannot do
It cannot give the model access to your proprietary
database. It cannot make the model aware of events after
its training cutoff. It cannot reliably teach the model
specific facts -- studies show fine-tuned models still
hallucinate factual details at roughly the same rate as
base models. The behavior changes. The knowledge does not.
What RAG Actually Does
RAG gives the model access to information at query
time. It does not change how the model behaves. It
changes what the model knows when answering a specific
question.
The mechanics: a user asks a question, you search a
knowledge base for relevant context, you stuff that
context into the prompt alongside the question, and
the model generates an answer grounded in the retrieved
documents.
I built a RAG system for a fintech company that needed
to answer questions about their 2,000-page regulatory
compliance handbook. The model's behavior did not need
to change. It already knew how to summarize and explain.
It just needed access to the right information at the
right time.
When RAG works well
- Private or proprietary data: Company docs, internal
wikis, customer records
- Frequently updated information: Pricing, inventory,
policies that change weekly
- Audit trail requirements: You need to cite sources
and show where answers came from
- Large knowledge bases: Thousands of documents that
would never fit in a fine-tuning dataset
What RAG cannot do
It cannot change the model's personality, output format,
or reasoning style. If the base model writes like a
chatbot and you need it to write like a radiologist, RAG
will not fix that. You will get chatbot-style responses
that happen to reference radiology documents.
The Decision Matrix
Here is the framework I use. It takes about five minutes
to work through, and it has saved me from over-engineering
more times than I can count.
| What You Need | Solution | Example |
|---|---|---|
| Domain-specific behavior or format | Fine-tune | Legal briefs in court format |
| Access to current or private data | RAG | Q&A over internal docs |
| Both behavior and knowledge | Fine-tune + RAG | Medical assistant with hospital records |
| Neither (general task, public data) | Prompt engineering | Customer support chatbot |
The fourth row is the one people skip. Prompt engineering
handles roughly 80% of the use cases I see in production.
Before you reach for fine-tuning or RAG, spend a serious
week on prompt engineering. I mean structured prompts with
examples, chain-of-thought, output schemas. Not "you are
a helpful assistant."
If prompt engineering gets you to 85% quality and your
users are happy, ship it. You can always add fine-tuning
or RAG later when you have real usage data telling you
where quality falls short.
Cost Comparison: Real Numbers
This is where most blog posts wave their hands. Let me
give you actual numbers from 2026 pricing so you can
build a business case.
Fine-Tuning Costs
Training (one-time per model version):
- GPT-4o mini fine-tuning: $3.00 per 1M training tokens
- GPT-4o fine-tuning: $25.00 per 1M training tokens
- A typical training dataset of 500 examples at ~1,000
tokens each = 500K tokens
- Total training cost: $1.50 to $12.50 per run
That is surprisingly cheap. The hidden cost is in
dataset curation. I spent 40 hours building a quality
training set for a recent project. At engineering rates,
that is $6,000-$10,000 of labor for the data, and $5 for
the actual training.
Inference (ongoing):
- GPT-4o mini fine-tuned: $0.30 per 1M input tokens,
$1.20 per 1M output tokens
- GPT-4o fine-tuned: $3.75 per 1M input tokens,
$15.00 per 1M output tokens
Fine-tuned inference costs roughly 1.5x the base model.
But if you fine-tuned a small model to match a large
model's quality, you might save 5-10x on inference.
That math works at scale.
RAG Costs
RAG has four cost components. All of them are ongoing.
Embedding generation:
- OpenAI text-embedding-3-small: $0.02 per 1M tokens
- At 10,000 queries/day with 50-token queries: ~$0.30/day
Vector database:
- Pinecone Starter: free up to 100K vectors
- Pinecone Standard: ~$70/month for 1M vectors
- Supabase pgvector: $25/month (shared infra)
- Self-hosted Qdrant: $50-200/month (compute costs)
Retrieval latency:
- Vector search: 20-100ms per query
- Re-ranking (if used): 50-200ms additional
- This adds up. If your base LLM response is 500ms,
RAG adds 15-40% latency overhead.
LLM inference with context:
- Retrieved chunks add 500-3,000 tokens per query
- At GPT-4o pricing ($2.50 per 1M input tokens),
that is $0.001-$0.008 per query just for the
retrieved context
- At 10,000 queries/day: $10-$80/day in added context
Total RAG cost at 10,000 queries/day: $300-$2,500/month
depending on your architecture choices. I covered how to
bring this down in my RAG cost optimization post.
The Comparison
For a system handling 10,000 queries per day:
| Component | Fine-Tune Only | RAG Only | Both |
|---|---|---|---|
| Setup cost | $5K-$15K (data + training) | $1K-$5K (pipeline) | $10K-$20K |
| Monthly infra | $0 | $100-$300 | $100-$300 |
| Monthly inference | $400-$1,200 | $600-$2,500 | $500-$1,500 |
| Maintenance | Low (retrain quarterly) | Medium (index updates) | High |
Fine-tuning has higher upfront cost but lower ongoing
cost. RAG has lower upfront cost but ongoing infra and
maintenance overhead. The hybrid is the most expensive
to build and maintain -- only use it when you genuinely
need both behavior change and knowledge access.
The Hybrid Pattern
Sometimes you actually need both. I built a hybrid
system for a healthcare company that needed a model to:
(a) write clinical notes in a specific format with
proper medical terminology, and (b) reference
patient-specific records and treatment protocols.
Fine-tuning handled the behavior: consistent note
format, appropriate medical language, structured
assessment sections. RAG handled the knowledge:
patient history, current medications, relevant
clinical guidelines.
Architecture
The flow looks like this:
User Query
|
v
[Embedding Model] --> [Vector DB Search]
| |
| Retrieved Context
| |
v v
[Fine-Tuned LLM] <-- [Prompt Template]
|
v
Formatted Response
The fine-tuned model receives the retrieved context
through a prompt template. It already knows how to
format clinical notes. The RAG pipeline gives it the
specific patient data to reference. Each component
does what it is good at.
Implementation order matters
If you are building a hybrid system, build RAG first.
Here is why:
- RAG gives you immediate value -- users can query
their data right away
- RAG usage data tells you where behavior needs to
change, which informs your fine-tuning dataset
- You can evaluate whether fine-tuning is actually
needed before investing in dataset curation
I have seen teams spend months fine-tuning a model only
to discover that a well-structured prompt template with
RAG-retrieved examples solved the same problem at
one-tenth the effort.
When to Skip Both
This is the most useful section of this post, and the
one nobody wants to hear.
Prompt engineering handles the majority of use cases.
Not 50%. Not 60%. I estimate 80% of the production AI
systems I have reviewed would work fine with thoughtful
prompt engineering alone.
Here is what "thoughtful prompt engineering" means:
- System prompts with explicit constraints: Not
"you are a helpful assistant" but a 200-line system
prompt with output format, error handling, edge cases,
and examples
- Few-shot examples: 3-5 examples of ideal
input/output pairs in the prompt
- Output schemas: Structured output with JSON
schema validation, not free-form text
- Chain-of-thought: Explicit reasoning steps for
complex queries
- Guardrails: Input validation, output filtering,
and fallback responses
A well-engineered prompt system costs $0 in
infrastructure, takes days instead of months to build,
and can be iterated on without retraining anything. The
tradeoff is higher per-query token cost (longer prompts)
and less consistency on edge cases.
The decision checklist
Before you start building fine-tuning or RAG:
- Have you spent at least a week on prompt
engineering? Not an afternoon. A week.
- Do you have quantitative evals showing prompt
engineering fails? "It does not feel right" is not
a measurement.
- Do you have at least 200 labeled examples?
You need these for fine-tuning. If you do not have
them, you do not have enough data to fine-tune.
- Is your knowledge base larger than what fits in
a context window? Modern models accept 128K-1M
tokens. Maybe you do not need RAG at all.
- Have you calculated the unit economics? Know your
cost per query before and after, not just whether it
"works."
The Framework in Practice
I recently consulted on a system for an e-commerce
company. They wanted an AI that could:
- Answer questions about 50,000 products
- Respond in the brand's casual, witty tone
- Handle returns, sizing, and shipping queries
- Reference real-time inventory
Their initial plan was to fine-tune GPT-4o on their
brand voice and build a RAG system for product data.
Estimated build time: 3 months. Estimated monthly
cost: $4,000.
What we actually shipped:
- A detailed system prompt with brand voice examples
and 5 few-shot conversations (prompt engineering)
- RAG over their product catalog and FAQ database
(needed for 50,000 products and real-time inventory)
- No fine-tuning
Build time: 3 weeks. Monthly cost: $800. The brand
voice was consistent enough with few-shot examples
that fine-tuning was not worth the maintenance burden.
We saved two months of engineering time and $3,200
per month by asking "do we actually need this?" before
building.
The Bottom Line
The decision is not fine-tuning vs RAG. The decision is:
what problem am I solving, and what is the simplest
architecture that solves it?
Start with prompt engineering. Add RAG if you need
external knowledge. Add fine-tuning if you need
behavioral change that prompts cannot achieve. Build
the hybrid only when you have proven you need both.
The best architecture is the one you do not over-build.
Ship the simplest thing that works, measure it in
production, and add complexity only when the data tells
you to.