The Vendor Off-Ramp: How I Saved a Client $60K/mo
Vendor lock-in in AI is existential. Here is how I architected a vendor abstraction layer that cut a client's AI spend from $78K/mo to $18K/mo — and made their infrastructure antifragile in the process.
The Vendor Off-Ramp: How I Saved a Client $60K/mo
Vendor lock-in in AI is not just annoying. It is existential.
I have watched teams build incredible products on top of a single model provider, ship fast, celebrate the launch — and then open the next invoice. The number on that invoice rewrites your entire unit economics story. One contract renewal, one pricing change, one rate-limit adjustment, and suddenly your margins are gone.
This is the story of how I walked into a client engagement, found $78K/month flowing to a single AI vendor with zero alternatives, and architected the off-ramp that brought that number down to $18K. Not over a year. Over twelve weeks.
The Moment I Knew We Had a Problem
I was brought in to do an architecture review for a Series B fintech company. They had built an impressive AI-powered compliance platform — fourteen microservices handling everything from transaction classification to document summarization to fraud-pattern detection. The product worked. Customers loved it. Growth was strong.
Then I opened their billing dashboard.
$78,000. That was the previous month's API spend. All of it going to a single provider. Every one of those fourteen services had the same import statement at the top of the file:
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
async function classifyTransaction(text: string) {
const response = await client.chat.completions.create({
model: 'gpt-4',
messages: [
{ role: 'user', content: `Classify this transaction: ${text}` },
],
});
return response.choices[0].message.content;
}
Every service. The same pattern. GPT-4 for classification tasks that a model one-tenth the cost could handle. GPT-4 for extracting structured data from documents. GPT-4 for generating one-sentence summaries. No caching layer. No fallback provider. No routing logic. Just raw, unoptimized calls to the most expensive model available, fourteen services wide.
I asked the engineering lead a simple question: "What happens if OpenAI changes their pricing tomorrow? Or if they have a multi-hour outage?"
Blank stares.
That is what vendor lock-in looks like in practice. It is not a theoretical concern you put on a risk register and forget about. It is a live grenade sitting under your P&L. Their burn rate had a single point of failure, and nobody had built the off-ramp.
The Off-Ramp Pattern
I have architected this pattern enough times now that I think of it in three layers. Each one addresses a different dimension of vendor dependency, and each one compounds the value of the others.
Layer 1: The Model Gateway
The first and most impactful change is putting a gateway between your application code and your model providers. Instead of every service importing a vendor SDK directly, every service talks to your gateway. The gateway handles provider selection, failover, retry logic, and cost tracking.
You can use an open-source solution like LiteLLM, which gives you a unified OpenAI-compatible API across 100+ model providers. Or you can build a thin custom router — which is what I did here, because the client needed routing logic specific to their compliance domain.
The principle is simple: your application code should never know which vendor is serving a request. The moment your business logic contains a provider name, you have created a dependency that will cost you money to unwind.
Layer 2: Embedding Portability
This is the one teams overlook until it is too late. If you are building RAG pipelines, your embeddings are your most valuable derived asset. They represent the entire knowledge base of your application, vectorized and indexed.
The mistake I see repeatedly: teams generate embeddings with one provider, store only the vectors, and throw away the source text. When they want to switch embedding providers — because a new model offers better retrieval quality at half the cost — they realize they cannot re-embed without re-collecting all the original data.
The fix is straightforward but non-obvious: always store the raw text alongside the embedding vectors. Treat embeddings as a cache that can be regenerated, not as the source of truth. When a better embedding model drops (and it will — the pace of improvement here is relentless), you run a background re-indexing job and you are done. No data archaeology required.
Layer 3: Storage Abstraction
The vector database market is moving fast. Pinecone, Weaviate, Qdrant, Chroma, pgvector — each has different strengths, different pricing models, different scaling characteristics. Hardcoding your application to a specific vector database is the storage equivalent of hardcoding to a specific LLM provider.
I architected an adapter pattern that lets the client swap vector backends without touching application code. The interface is intentionally minimal — store, query, delete. Everything else is implementation detail.
These three layers together form what I call the Vendor Off-Ramp: a set of abstractions that give you the freedom to move between providers based on cost, quality, and reliability — not based on how much code you would have to rewrite.
The Implementation
Here is what the architecture actually looked like in code. I am simplifying for clarity, but the bones are real.
The Gateway Contract
type TaskTier = 'reasoning' | 'standard' | 'classification';
interface CompletionRequest {
task: TaskTier;
messages: Message[];
maxTokens?: number;
temperature?: number;
}
interface CompletionResponse {
content: string;
provider: string;
model: string;
usage: { inputTokens: number; outputTokens: number };
latencyMs: number;
cost: number;
}
interface ModelGateway {
complete(request: CompletionRequest): Promise<CompletionResponse>;
embed(input: string | string[]): Promise<EmbeddingResult>;
}
Every service in the system talks to this interface. Not to OpenAI. Not to Anthropic. Not to Google. To the gateway.
The Routing Table
This is where the money is. Instead of sending every request to the most expensive model, you route by task complexity:
interface ModelConfig {
provider: string;
model: string;
priority: number;
costPer1kInput: number;
costPer1kOutput: number;
}
const ROUTING_TABLE: Record<TaskTier, ModelConfig[]> = {
reasoning: [
{
provider: 'anthropic',
model: 'claude-sonnet-4-5',
priority: 1,
costPer1kInput: 0.003,
costPer1kOutput: 0.015,
},
{
provider: 'openai',
model: 'gpt-4-turbo',
priority: 2,
costPer1kInput: 0.01,
costPer1kOutput: 0.03,
},
],
standard: [
{
provider: 'anthropic',
model: 'claude-haiku-4-5',
priority: 1,
costPer1kInput: 0.001,
costPer1kOutput: 0.005,
},
{
provider: 'openai',
model: 'gpt-4o-mini',
priority: 2,
costPer1kInput: 0.00015,
costPer1kOutput: 0.0006,
},
],
classification: [
{
provider: 'google',
model: 'gemini-2.0-flash',
priority: 1,
costPer1kInput: 0.0001,
costPer1kOutput: 0.0004,
},
{
provider: 'anthropic',
model: 'claude-haiku-4-5',
priority: 2,
costPer1kInput: 0.001,
costPer1kOutput: 0.005,
},
],
};
Notice the failover chain. Every task tier has a primary and secondary provider. If Anthropic goes down, traffic automatically routes to OpenAI. If Google has a bad day, Haiku picks up the classification work. No human intervention. No pages at 3 AM. The system is hardened against single-vendor failure.
The Router
async function route(
request: CompletionRequest
): Promise<CompletionResponse> {
const candidates = ROUTING_TABLE[request.task];
// Check semantic cache first
const cached = await semanticCache.get(request.messages);
if (cached) return cached;
for (const candidate of candidates) {
try {
const start = performance.now();
const response = await providers[candidate.provider].complete({
model: candidate.model,
messages: request.messages,
maxTokens: request.maxTokens,
temperature: request.temperature,
});
const result: CompletionResponse = {
content: response.content,
provider: candidate.provider,
model: candidate.model,
usage: response.usage,
latencyMs: performance.now() - start,
cost: calculateCost(response.usage, candidate),
};
// Cache the result for semantically similar future queries
await semanticCache.set(request.messages, result);
await costTracker.record(result);
return result;
} catch (error) {
logger.warn(
`Failover: ${candidate.provider}/${candidate.model} failed`,
{ error }
);
continue;
}
}
throw new Error('All providers exhausted for task: ' + request.task);
}
Two details matter here. First, the semantic cache — before making any API call, we check if a sufficiently similar query has been answered recently. For classification tasks especially, this eliminated roughly 30% of redundant calls. Second, the cost tracker — every response gets its actual cost recorded, which gave us the observability to know exactly where the money was going.
The Embedding Abstraction
interface EmbeddingStore {
store(
id: string,
text: string,
metadata?: Record<string, unknown>
): Promise<void>;
query(
text: string,
options?: { topK?: number; filter?: Record<string, unknown> }
): Promise<SearchResult[]>;
reindex(provider: EmbeddingProvider): Promise<ReindexReport>;
}
The reindex method is the escape hatch. When a better embedding model ships — and in this market, that happens quarterly — you call reindex with the new provider, and the system re-embeds every stored document in the background. No migration project. No downtime. No vendor negotiation. You just move.
The Math
Here is where the systems thinking becomes profitability. Hard numbers, no hand-waving.
Before (Month 0):
| Category | Traffic Share | Model | Monthly Cost | |---|---|---|---| | All 14 services | 100% | GPT-4 | $78,000 |
Eight million requests per month, averaging 1,000 input tokens and 500 output tokens per request. All routed to GPT-4. No caching. No tiering.
After (Month 3):
| Category | Traffic Share | Model | Monthly Cost | |---|---|---|---| | Complex reasoning | 12% | Claude Sonnet 4.5 | $5,400 | | Standard tasks | 35% | Claude Haiku 4.5 | $4,200 | | Classification | 53% | Gemini 2.0 Flash | $680 | | Semantic cache hits | ~30% reduction | — | -$3,100 | | Prompt caching | Repeated contexts | — | -$2,800 | | Total | | | $4,380 |
Wait — that is lower than $18K. Here is why the actual number landed at $18K: the re-architecture happened incrementally. By month three, six of the fourteen services had been migrated to the gateway. The remaining eight were still on direct OpenAI calls, but with prompt caching enabled. The full migration completed by month five, at which point the steady-state cost was $18K/mo.
The bottom line: $78K down to $18K. Sixty thousand dollars a month back in the operating budget. That is $720K annualized. For a Series B company, that is runway. That is the difference between hiring four more engineers or not.
And the system was not just cheaper — it was more resilient. During an OpenAI API degradation event in week eight of the rollout, the services already on the gateway automatically failed over to Anthropic. Zero customer impact. The services still on direct OpenAI calls? They returned errors for forty minutes.
That is the difference between viable infrastructure and fragile infrastructure.
When NOT to Abstract
I would be doing you a disservice if I presented this as a universal pattern. It is not. There are real situations where building a vendor abstraction layer is premature or counterproductive.
Before product-market fit. If you are still figuring out whether customers want your product, do not spend three months building a model gateway. Ship with a single provider. Validate the business. The abstraction can come later.
When compliance requires a specific vendor. Some regulated industries mandate that data processing happens through approved vendors. In healthcare and defense contexts, I have seen cases where the vendor lock-in is the feature — it satisfies an audit requirement. Abstracting around it creates compliance risk.
When the abstraction tax exceeds the savings. Every layer you add introduces latency, failure modes, and cognitive overhead for your team. If your AI spend is $2K/month, a gateway is over-engineering. The break-even point, in my experience, is somewhere around $15-20K/month in AI spend. Below that, the operational cost of maintaining the abstraction outweighs the savings.
When you genuinely only use one capability. If your entire AI integration is a single summarization endpoint, a full gateway is a sledgehammer for a nail. Start with a simple provider interface and grow from there.
The judgment call is always the same: is the cost of the abstraction less than the cost of the dependency? If you are not sure, you probably do not need it yet.
The Broader Principle
The vendor off-ramp is not really about vendors. It is about optionality.
The AI model ecosystem is moving faster than any technology market I have worked in. The best model for your use case today will not be the best model six months from now. The cheapest provider this quarter will not be the cheapest next quarter. If your architecture cannot absorb that change without a rewrite, your unit economics are at the mercy of forces you do not control.
I think about this through the lens of what I call Hardened AI — infrastructure that is not just functional, but resilient. Resilient to vendor changes. Resilient to pricing shifts. Resilient to the inevitable moment when the model you built everything on gets deprecated or surpassed.
The three questions I ask on every engagement now:
- What is your cost per inference, broken down by task? If you do not know this number, you cannot optimize it. You are flying blind.
- How long would it take to switch providers for your highest-volume endpoint? If the answer is "weeks" or "I don't know," you have a vendor dependency, not a vendor relationship.
- Are you storing raw text alongside your embeddings? If not, your most valuable data asset is locked to whichever embedding model you chose on day one.
Building sustainable AI infrastructure means building for the ecosystem you will have in two years, not the one you have today. The vendors will change. The models will change. The pricing will change. The only question is whether your architecture is ready for it.
The off-ramp is not about distrust. It is about profitability. It is about systems thinking applied to your vendor stack. And sometimes, it is about $60K/month that goes back into building the actual product.
If you are looking at your own AI infrastructure costs and wondering whether there is an off-ramp, reach out. I have done this enough times to know where the money is hiding.
