Start Lesson
Your embedding model determines the ceiling of your retrieval quality. No amount of re-ranking or prompt engineering can fix a system that embedded your documents with the wrong model. I have seen teams spend months tuning their retrieval pipeline when a simple embedding swap would have solved the problem in an afternoon.
This lesson covers how to evaluate embedding models for your specific domain, what the current landscape looks like, and when to consider fine-tuning.
An embedding model converts text into a dense numerical vector (typically 768--3072 dimensions) that captures semantic meaning. Similar texts produce vectors that are close together in this high-dimensional space.
"How do I cancel my subscription?" -> [0.23, -0.41, 0.87, ...]
"Cancel subscription process" -> [0.25, -0.39, 0.85, ...] <- close
"The weather in Miami is warm" -> [-0.71, 0.12, 0.33, ...] <- far
The quality of these vectors determines whether your retrieval system finds the right documents. A model trained primarily on web text may not understand that "EOB" means "Explanation of Benefits" in a healthcare context, or that "P&L" means "Profit and Loss" in finance.
Here is how the major embedding models compare on production-relevant dimensions:
| Model | Dimensions | Max Tokens | MTEB Score | Cost (per 1M tokens) | Best For | |-------|-----------|------------|------------|----------------------|----------| | Voyage AI voyage-3-large | 1024 | 32,000 | Highest | ~$0.18 | Domain-specific, long docs | | OpenAI text-embedding-3-large | 3072 | 8,191 | Strong | $0.13 | General purpose, battle-tested | | Cohere embed-v4 | 1024 | 128,000 | Strong | $0.10 | Multilingual, long-context | | BGE-M3 (open-source) | 1024 | 8,192 | Strong | Self-hosted | Privacy-sensitive, cost control | | Nomic Embed v1.5 (open-source) | 768 | 8,192 | Good | Self-hosted | Budget-conscious, on-prem |
Key insight from benchmarks: Voyage AI's voyage-3-large leads on domain-specific retrieval tasks across MTEB. But benchmarks are averages --- your mileage depends on your data. A model that ranks third on public benchmarks may rank first on your domain. Always test with your own eval set.
More dimensions does not automatically mean better. OpenAI's 3072-dimension model uses 3x the storage of a 1024-dimension model. At scale, this matters:
1 million documents x 10 chunks each = 10M vectors
At 3072 dimensions (float32):
10M x 3072 x 4 bytes = ~115 GB
At 1024 dimensions (float32):
10M x 1024 x 4 bytes = ~38 GB
Storage cost difference: ~$50-100/month on managed vector DBs
OpenAI's text-embedding-3 models support Matryoshka embeddings --- you can truncate to fewer dimensions (e.g., 256 or 512) with graceful quality degradation. This is a powerful cost lever.
Do not trust benchmarks. Build a domain-specific evaluation set and test yourself.
Create 50--100 query-document pairs from your actual data:
eval_pairs = [
{
"query": "What is the return policy for electronics?",
"relevant_doc_ids": ["doc_123", "doc_456"],
"irrelevant_doc_ids": ["doc_789"] # hard negatives
},
# ... 50-100 more pairs
]
Include hard negatives --- documents that look relevant but are not. "Shipping policy for electronics" is a hard negative for a query about return policy. These test whether the model understands nuance, not just topic.
def evaluate_embedding_model(model, eval_pairs, k=5):
results = {"recall_at_k": [], "mrr": []}
for pair in eval_pairs:
query_vec = model.embed(pair["query"])
retrieved = vector_db.search(query_vec, top_k=k)
retrieved_ids = [r.id for r in retrieved]
# Recall@K: Did we find the relevant docs?
hits = len(set(retrieved_ids) & set(pair["relevant_doc_ids"]))
recall = hits / len(pair["relevant_doc_ids"])
results["recall_at_k"].append(recall)
# MRR: How high did the first relevant doc rank?
for rank, doc_id in enumerate(retrieved_ids, 1):
if doc_id in pair["relevant_doc_ids"]:
results["mrr"].append(1.0 / rank)
break
else:
results["mrr"].append(0.0)
return {
"recall@k": sum(results["recall_at_k"]) / len(results["recall_at_k"]),
"mrr": sum(results["mrr"]) / len(results["mrr"])
}
Run your eval set against 2--3 candidate models. I typically test:
A 5% recall improvement might justify a 2x cost increase if you are in a high-stakes domain (healthcare, legal, finance). For a customer support chatbot, the cheaper model that gets 90% recall may be the right business decision.
Fine-tuning an embedding model on your domain data can yield 5--15% retrieval improvement. But it adds significant engineering complexity.
Fine-tune when:
Do not fine-tune when:
# Example: Fine-tuning with sentence-transformers
from sentence_transformers import SentenceTransformer, InputExample, losses
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
train_examples = [
InputExample(
texts=["EOB denied claim", "Explanation of Benefits showing claim denial"],
label=1.0
),
InputExample(
texts=["EOB denied claim", "End of Business hours schedule"],
label=0.0
),
]
train_loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
output_path="./fine-tuned-embeddings"
)
Starting a new project: Use OpenAI text-embedding-3-large. It is battle-tested, well-documented, and the Matryoshka dimension reduction gives you a cost lever for later optimization.
Hitting quality limits: Evaluate Voyage AI voyage-3-large, especially if your documents are long (it supports 32K tokens vs. OpenAI's 8K).
Cost-constrained or privacy-sensitive: Deploy BGE-M3 on your own infrastructure. Self-hosting eliminates per-token costs entirely at the expense of infrastructure management.
Multilingual requirements: Cohere embed-v4 is purpose-built for cross-lingual retrieval and supports up to 128K tokens of context.
| Scenario | Recommended Model | Why | Watch Out For | |----------|-------------------|-----|---------------| | General-purpose, fast start | OpenAI text-embedding-3-large | Battle-tested, Matryoshka support, good docs | 3072 dims = higher storage cost | | Long documents (>8K tokens) | Voyage AI voyage-3-large | 32K context window, strong domain performance | Higher per-token cost | | Multilingual corpus | Cohere embed-v4 | Built for cross-lingual, 128K context | Newer model, less community tooling | | Privacy or on-prem requirement | BGE-M3 | Self-hosted, no data leaves your infra | You manage the infrastructure | | Tight budget, acceptable quality | Nomic Embed v1.5 | Open-source, 768 dims = low storage | Lower MTEB scores on specialized tasks | | Specialized domain (medical, legal) | Fine-tuned BGE or Voyage | 5-15% retrieval lift on domain data | Needs 10K+ training pairs, engineering cost |
Use this checklist to assess your embedding strategy:
If you have not built a domain-specific eval set, stop here and build one. No amount of model comparison is meaningful without it. Fifty query-document pairs is enough to start.
We combine dense embeddings with sparse retrieval in Hybrid Search: Combining Dense and Sparse Retrieval. Dense search alone has blind spots that keyword matching covers, and vice versa. Lesson 4 shows how to get the best of both.