The Vendor Off-Ramp: How I Cut $60K/Month in AI Spend Without Rewriting the Stack

The Vendor Off-Ramp: How I Cut $60K/Month in AI Spend Without Rewriting the Stack | Celestinosalim.com

The Vendor Off-Ramp: How I Cut $60K/Month in AI Spend

Vendor lock-in in AI is not just annoying. It is existential.

I have watched teams build products on top of a single model provider, ship fast, celebrate the launch, and then open the next invoice. One contract renewal, one pricing change, one rate-limit adjustment, and suddenly your margins are gone.

At Eventbrite, I saw this firsthand. External API costs were running at $15K per day before I built the caching and deduplication layer that brought them down to $40/month. On the Ads platform, improved cost visibility led to sunsetting a third-party ML ranking system that was costing $60K/month. In both cases, the fix was not switching vendors -- it was making vendor choice a routing decision instead of an architecture decision.

This is the pattern: the Vendor Off-Ramp.

The Problem: `import OpenAI` in Every File

The pattern I see most often looks like this:

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

async function classifyTransaction(text: string) {
  const response = await client.chat.completions.create({
    model: 'gpt-4',
    messages: [
      { role: 'user', content: `Classify this transaction: ${text}` },
    ],
  });
  return response.choices[0].message.content;
}

The same provider SDK imported in every service. GPT-4 for classification tasks that a model one-tenth the cost could handle. GPT-4 for extracting structured data. GPT-4 for generating one-sentence summaries. No caching layer. No fallback provider. No routing logic. Just raw, unoptimized calls to the most expensive model available, across every code path.

Ask the team: "What happens if this provider changes pricing tomorrow? Or has a multi-hour outage?" If the answer is blank stares, you have a live grenade under your P&L.

The Three-Layer Off-Ramp Pattern

I think about vendor off-ramps in three layers. Each addresses a different dimension of vendor dependency, and each compounds the value of the others.

Layer 1: The Model Gateway

The most impactful change: put a gateway between your application code and your model providers. Instead of every service importing a vendor SDK directly, every service talks to your gateway. The gateway handles provider selection, failover, retry logic, and cost tracking.

You can use an open-source solution like LiteLLM, which gives you a unified OpenAI-compatible API across 100+ model providers. Or you can build a thin custom router if you need domain-specific routing logic.

The principle: your application code should never know which vendor is serving a request. The moment your business logic contains a provider name, you have created a dependency that will cost you money to unwind.

Layer 2: Embedding Portability

This is the one teams overlook until it is too late. If you are building RAG pipelines, your embeddings are your most valuable derived asset -- the entire knowledge base of your application, vectorized and indexed.

The mistake I see repeatedly: teams generate embeddings with one provider, store only the vectors, and throw away the source text. When they want to switch embedding providers -- because a new model offers better retrieval quality at half the cost -- they realize they cannot re-embed without re- collecting all the original data.

The fix is straightforward: always store the raw text alongside the embedding vectors. Treat embeddings as a cache that can be regenerated, not as the source of truth. When a better embedding model ships (and it will), you run a background re-indexing job and you are done.

Layer 3: Storage Abstraction

The vector database market is moving fast. Pinecone, Weaviate, Qdrant, Chroma, pgvector -- each has different strengths, pricing, and scaling characteristics. Hardcoding your application to a specific vector database is the storage equivalent of hardcoding to a specific LLM provider.

An adapter pattern that lets you swap vector backends without touching application code keeps the interface intentionally minimal: store, query, delete. Everything else is implementation detail.

These three layers together form the Vendor Off-Ramp: a set of abstractions that give you freedom to move between providers based on cost, quality, and reliability -- not based on how much code you would have to rewrite.

The Implementation

Here is what the architecture looks like in code.

The Gateway Contract

type TaskTier = 'reasoning' | 'standard' | 'classification';

interface CompletionRequest {
  task: TaskTier;
  messages: Message[];
  maxTokens?: number;
  temperature?: number;
}

interface CompletionResponse {
  content: string;
  provider: string;
  model: string;
  usage: { inputTokens: number; outputTokens: number };
  latencyMs: number;
  cost: number;
}

interface ModelGateway {
  complete(request: CompletionRequest): Promise<CompletionResponse>;
  embed(input: string | string[]): Promise<EmbeddingResult>;
}

Every service in the system talks to this interface. Not to OpenAI. Not to Anthropic. Not to Google. To the gateway.

The Routing Table

This is where the money is. Instead of sending every request to the most expensive model, you route by task complexity:

interface ModelConfig {
  provider: string;
  model: string;
  priority: number;
  costPer1kInput: number;
  costPer1kOutput: number;
}

const ROUTING_TABLE: Record<TaskTier, ModelConfig[]> = {
  reasoning: [
    {
      provider: 'anthropic',
      model: 'claude-sonnet-4-5',
      priority: 1,
      costPer1kInput: 0.003,
      costPer1kOutput: 0.015,
    },
    {
      provider: 'openai',
      model: 'gpt-4o',
      priority: 2,
      costPer1kInput: 0.01,
      costPer1kOutput: 0.03,
    },
  ],
  standard: [
    {
      provider: 'anthropic',
      model: 'claude-haiku-4-5',
      priority: 1,
      costPer1kInput: 0.001,
      costPer1kOutput: 0.005,
    },
    {
      provider: 'openai',
      model: 'gpt-4o-mini',
      priority: 2,
      costPer1kInput: 0.00015,
      costPer1kOutput: 0.0006,
    },
  ],
  classification: [
    {
      provider: 'google',
      model: 'gemini-2.0-flash',
      priority: 1,
      costPer1kInput: 0.0001,
      costPer1kOutput: 0.0004,
    },
    {
      provider: 'anthropic',
      model: 'claude-haiku-4-5',
      priority: 2,
      costPer1kInput: 0.001,
      costPer1kOutput: 0.005,
    },
  ],
};

Notice the failover chain. Every task tier has a primary and secondary provider. If Anthropic goes down, traffic automatically routes to OpenAI. If Google has a bad day, Haiku picks up the classification work. No human intervention. The system is resilient against single-vendor failure.

The Router

async function route(
  request: CompletionRequest
): Promise<CompletionResponse> {
  const candidates = ROUTING_TABLE[request.task];

  // Check semantic cache first
  const cached = await semanticCache.get(request.messages);
  if (cached) return cached;

  for (const candidate of candidates) {
    try {
      const start = performance.now();
      const response = await providers[candidate.provider].complete({
        model: candidate.model,
        messages: request.messages,
        maxTokens: request.maxTokens,
        temperature: request.temperature,
      });

      const result: CompletionResponse = {
        content: response.content,
        provider: candidate.provider,
        model: candidate.model,
        usage: response.usage,
        latencyMs: performance.now() - start,
        cost: calculateCost(response.usage, candidate),
      };

      // Cache the result for semantically similar future queries
      await semanticCache.set(request.messages, result);
      await costTracker.record(result);

      return result;
    } catch (error) {
      logger.warn(
        `Failover: ${candidate.provider}/${candidate.model} failed`,
        { error }
      );
      continue;
    }
  }

  throw new Error('All providers exhausted for task: ' + request.task);
}

Two details matter here. First, the semantic cache: before making any API call, we check if a sufficiently similar query has been answered recently. For classification tasks, this eliminates 30%+ of redundant calls. Second, the cost tracker: every response gets its actual cost recorded, giving you the observability to know exactly where the money is going.

The Embedding Abstraction

interface EmbeddingStore {
  store(
    id: string,
    text: string,
    metadata?: Record<string, unknown>
  ): Promise<void>;

  query(
    text: string,
    options?: { topK?: number; filter?: Record<string, unknown> }
  ): Promise<SearchResult[]>;

  reindex(provider: EmbeddingProvider): Promise<ReindexReport>;
}

The reindex method is the escape hatch. When a better embedding model ships (and in this market, that happens quarterly), you call reindex with the new provider, and the system re-embeds every stored document in the background. No migration project. No downtime. You just move.

When NOT to Abstract

This is not a universal pattern. There are real situations where building a vendor abstraction is premature or counterproductive.

Before product-market fit. If you are still figuring out whether customers want your product, do not spend three months building a model gateway. Ship with a single provider. Validate the business. The abstraction comes later.

When compliance requires a specific vendor. Some regulated industries mandate that data processing happens through approved vendors. In healthcare and defense, I have seen cases where the vendor lock-in is the feature -- it satisfies an audit requirement. Abstracting around it creates compliance risk.

When the abstraction tax exceeds the savings. Every layer you add introduces latency, failure modes, and cognitive overhead for your team. If your AI spend is $2K/month, a gateway is over-engineering. The break-even point, in my experience, is around $15-20K/month in AI spend. Below that, the operational cost of maintaining the abstraction outweighs the savings.

When you genuinely only use one capability. If your entire AI integration is a single summarization endpoint, a full gateway is a sledgehammer for a nail. Start with a simple provider interface and grow from there.

The judgment call is always the same: is the cost of the abstraction less than the cost of the dependency? If you are not sure, you probably do not need it yet.

The Broader Principle: Optionality

The vendor off-ramp is not really about vendors. It is about optionality.

The AI model ecosystem is moving faster than any technology market I have worked in. The best model for your use case today will not be the best model six months from now. The cheapest provider this quarter will not be the cheapest next quarter. If your architecture cannot absorb that change without a rewrite, your unit economics are at the mercy of forces you do not control.

The three questions I ask on every system I review:

What is your cost per inference, broken down by task? If you do not know this number, you cannot optimize it.
How long would it take to switch providers for your highest-volume endpoint? If the answer is "weeks" or "I don't know," you have a vendor dependency, not a vendor relationship.
Are you storing raw text alongside your embeddings? If not, your most valuable data asset is locked to whichever embedding model you chose on day one.

Building sustainable AI infrastructure means building for the ecosystem you will have in two years, not the one you have today. The vendors will change. The models will change. The pricing will change. The only question is whether your architecture is ready for it.

If you want to see this pattern running in production, talk to my AI. It runs on the exact gateway architecture described here -- model routing, failover chains, cost tracking, all of it. Or if you are looking at your own AI infrastructure costs and wondering whether there is an off-ramp, reach out.

The Vendor Off-Ramp: How I Cut $60K/Month in AI Spend Without Rewriting the Stack

Discussion

Fine-Tuning vs RAG: The Decision Framework Nobody Talks About

Why Your RAG System Is Bleeding Money (And How to Fix It)

Guardrails Are Not Optional: A Production Safety Implementation Guide