Choosing Embeddings for Your Domain | RAG Systems in Production | Celestinosalim.com

Choosing Embeddings for Your Domain

Your embedding model determines the ceiling of your retrieval quality. No amount of re-ranking or prompt engineering can fix a system that embedded your documents with the wrong model. I have seen teams spend months tuning their retrieval pipeline when a simple embedding swap would have solved the problem in an afternoon.

This lesson covers how to evaluate embedding models for your specific domain, what the current landscape looks like, and when to consider fine-tuning.

What Embeddings Actually Do

An embedding model converts text into a dense numerical vector (typically 768--3072 dimensions) that captures semantic meaning. Similar texts produce vectors that are close together in this high-dimensional space.

"How do I cancel my subscription?"  ->  [0.23, -0.41, 0.87, ...]
"Cancel subscription process"       ->  [0.25, -0.39, 0.85, ...]  <- close
"The weather in Miami is warm"      ->  [-0.71, 0.12, 0.33, ...]  <- far

The quality of these vectors determines whether your retrieval system finds the right documents. A model trained primarily on web text may not understand that "EOB" means "Explanation of Benefits" in a healthcare context, or that "P&L" means "Profit and Loss" in finance.

The Current Landscape (2025)

Here is how the major embedding models compare on production-relevant dimensions:

| Model | Dimensions | Max Tokens | MTEB Score | Cost (per 1M tokens) | Best For | |-------|-----------|------------|------------|----------------------|----------| | Voyage AI voyage-3-large | 1024 | 32,000 | Highest | ~$0.18 | Domain-specific, long docs | | OpenAI text-embedding-3-large | 3072 | 8,191 | Strong | $0.13 | General purpose, battle-tested | | Cohere embed-v4 | 1024 | 128,000 | Strong | $0.10 | Multilingual, long-context | | BGE-M3 (open-source) | 1024 | 8,192 | Strong | Self-hosted | Privacy-sensitive, cost control | | Nomic Embed v1.5 (open-source) | 768 | 8,192 | Good | Self-hosted | Budget-conscious, on-prem |

Key insight from benchmarks: Voyage AI's voyage-3-large leads on domain-specific retrieval tasks across MTEB. But benchmarks are averages --- your mileage depends on your data. A model that ranks third on public benchmarks may rank first on your domain. Always test with your own eval set.

Dimensions and Cost

More dimensions does not automatically mean better. OpenAI's 3072-dimension model uses 3x the storage of a 1024-dimension model. At scale, this matters:

1 million documents x 10 chunks each = 10M vectors

At 3072 dimensions (float32):
  10M x 3072 x 4 bytes = ~115 GB

At 1024 dimensions (float32):
  10M x 1024 x 4 bytes = ~38 GB

Storage cost difference: ~$50-100/month on managed vector DBs

OpenAI's text-embedding-3 models support Matryoshka embeddings --- you can truncate to fewer dimensions (e.g., 256 or 512) with graceful quality degradation. This is a powerful cost lever.

How to Evaluate for Your Domain

Do not trust benchmarks. Build a domain-specific evaluation set and test yourself.

Step 1: Build a Test Set

Create 50--100 query-document pairs from your actual data:

eval_pairs = [
    {
        "query": "What is the return policy for electronics?",
        "relevant_doc_ids": ["doc_123", "doc_456"],
        "irrelevant_doc_ids": ["doc_789"]  # hard negatives
    },
    # ... 50-100 more pairs
]

Include hard negatives --- documents that look relevant but are not. "Shipping policy for electronics" is a hard negative for a query about return policy. These test whether the model understands nuance, not just topic.

Step 2: Measure Retrieval Quality

def evaluate_embedding_model(model, eval_pairs, k=5):
    results = {"recall_at_k": [], "mrr": []}

    for pair in eval_pairs:
        query_vec = model.embed(pair["query"])
        retrieved = vector_db.search(query_vec, top_k=k)
        retrieved_ids = [r.id for r in retrieved]

        # Recall@K: Did we find the relevant docs?
        hits = len(set(retrieved_ids) & set(pair["relevant_doc_ids"]))
        recall = hits / len(pair["relevant_doc_ids"])
        results["recall_at_k"].append(recall)

        # MRR: How high did the first relevant doc rank?
        for rank, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in pair["relevant_doc_ids"]:
                results["mrr"].append(1.0 / rank)
                break
        else:
            results["mrr"].append(0.0)

    return {
        "recall@k": sum(results["recall_at_k"]) / len(results["recall_at_k"]),
        "mrr": sum(results["mrr"]) / len(results["mrr"])
    }

Step 3: Compare Models Head-to-Head

Run your eval set against 2--3 candidate models. I typically test:

One commercial leader (Voyage AI or OpenAI)
One open-source option (BGE-M3 or Nomic)
The cheapest viable option (for cost baseline)

A 5% recall improvement might justify a 2x cost increase if you are in a high-stakes domain (healthcare, legal, finance). For a customer support chatbot, the cheaper model that gets 90% recall may be the right business decision.

When to Fine-Tune

Fine-tuning an embedding model on your domain data can yield 5--15% retrieval improvement. But it adds significant engineering complexity.

Fine-tune when:

Your domain has specialized vocabulary (medical, legal, financial terminology).
Generic models consistently fail on your eval set despite trying multiple options.
You have at least 10,000 query-document pairs for training data.
The retrieval quality improvement justifies the engineering investment.

Do not fine-tune when:

You have not tried all major commercial models first.
Your eval set has fewer than 50 pairs (you cannot reliably measure improvement).
The bottleneck is chunking or re-ranking, not embedding quality.

# Example: Fine-tuning with sentence-transformers
from sentence_transformers import SentenceTransformer, InputExample, losses

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

train_examples = [
    InputExample(
        texts=["EOB denied claim", "Explanation of Benefits showing claim denial"],
        label=1.0
    ),
    InputExample(
        texts=["EOB denied claim", "End of Business hours schedule"],
        label=0.0
    ),
]

train_loss = losses.CosineSimilarityLoss(model)
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    output_path="./fine-tuned-embeddings"
)

Practical Recommendations

Starting a new project: Use OpenAI text-embedding-3-large. It is battle-tested, well-documented, and the Matryoshka dimension reduction gives you a cost lever for later optimization.

Hitting quality limits: Evaluate Voyage AI voyage-3-large, especially if your documents are long (it supports 32K tokens vs. OpenAI's 8K).

Cost-constrained or privacy-sensitive: Deploy BGE-M3 on your own infrastructure. Self-hosting eliminates per-token costs entirely at the expense of infrastructure management.

Multilingual requirements: Cohere embed-v4 is purpose-built for cross-lingual retrieval and supports up to 128K tokens of context.

Trade-Offs at a Glance

| Scenario | Recommended Model | Why | Watch Out For | |----------|-------------------|-----|---------------| | General-purpose, fast start | OpenAI text-embedding-3-large | Battle-tested, Matryoshka support, good docs | 3072 dims = higher storage cost | | Long documents (>8K tokens) | Voyage AI voyage-3-large | 32K context window, strong domain performance | Higher per-token cost | | Multilingual corpus | Cohere embed-v4 | Built for cross-lingual, 128K context | Newer model, less community tooling | | Privacy or on-prem requirement | BGE-M3 | Self-hosted, no data leaves your infra | You manage the infrastructure | | Tight budget, acceptable quality | Nomic Embed v1.5 | Open-source, 768 dims = low storage | Lower MTEB scores on specialized tasks | | Specialized domain (medical, legal) | Fine-tuned BGE or Voyage | 5-15% retrieval lift on domain data | Needs 10K+ training pairs, engineering cost |

Evaluate Your System

Use this checklist to assess your embedding strategy:

[ ] Have you tested at least 2 embedding models on your own domain-specific eval set (not just benchmarks)?
[ ] Do you know your Recall@5 and MRR with your current embedding model?
[ ] Have you calculated total storage cost at your target scale (vectors x dimensions x 4 bytes)?
[ ] Are you using dimension reduction (Matryoshka) or quantization to control storage costs?
[ ] Do your embeddings handle your domain's specialized vocabulary (acronyms, jargon, codes)?
[ ] Have you tested with hard negatives (semantically similar but wrong documents)?
[ ] If using a commercial API, do you have a fallback plan for API outages or price changes?

If you have not built a domain-specific eval set, stop here and build one. No amount of model comparison is meaningful without it. Fifty query-document pairs is enough to start.

Key Takeaways

Your embedding model sets the retrieval quality ceiling. No downstream optimization can compensate for poor embeddings.
Do not trust benchmarks alone. Build a 50--100 pair domain-specific eval set and test models against your actual data.
Consider the full cost picture: per-token API costs, vector storage (driven by dimensions), and engineering complexity.
Fine-tune only after exhausting commercial options and only when you have sufficient training data.
Use Matryoshka embeddings (OpenAI) or quantization to reduce storage costs with minimal quality loss.

What's Next

We combine dense embeddings with sparse retrieval in Hybrid Search: Combining Dense and Sparse Retrieval. Dense search alone has blind spots that keyword matching covers, and vice versa. Lesson 4 shows how to get the best of both.