Deploying AI on Vercel | Building Your First AI Product | Celestinosalim.com

Deploying AI on Vercel

What You Will Build

A production-ready version of your AI chat route with rate limiting, cost monitoring, and proper error handling. By the end, you will have a pre-launch checklist you can use for every AI feature you ship.

Here is the complete production chat route that ties together everything from the course:

// app/api/chat/route.ts
import { streamText } from 'ai'
import { checkRateLimit, recordUsage } from '@/lib/rate-limit'
import { retrieveContext } from '@/lib/rag/retrieve'

export const runtime = 'edge'

export async function POST(request: Request) {
  const userId = await getUserId(request)

  // Rate limiting
  const { allowed, remaining, limit } = await checkRateLimit(userId)
  if (!allowed) {
    return Response.json(
      { error: 'Daily limit reached. Resets at midnight UTC.' },
      {
        status: 429,
        headers: {
          'X-RateLimit-Limit': String(limit),
          'X-RateLimit-Remaining': '0'
        }
      }
    )
  }

  const { messages } = await request.json()
  const latestMessage = messages[messages.length - 1].content

  // RAG retrieval
  const context = await retrieveContext(latestMessage)
  const contextText = context.map((c) => c.content).join('\n\n---\n\n')

  const startTime = Date.now()

  const result = streamText({
    model: 'openai/gpt-4o-mini',
    system: `You are a helpful assistant. Answer based on the following context. If the context does not contain the answer, say so.

Context:
${contextText}`,
    messages,
    maxTokens: 500,
    onFinish: async ({ usage }) => {
      const latencyMs = Date.now() - startTime
      await recordUsage(userId, usage.totalTokens, latencyMs)
    }
  })

  return result.toUIMessageStreamResponse()
}

This route combines streaming (lesson 2), RAG retrieval (lesson 4), rate limiting, cost tracking, and edge runtime --- every pattern you have learned. Let us walk through the production concerns one at a time.

Environment Variables on Vercel

Your API keys live in .env.local during development. On Vercel, they go in the dashboard:

Settings > Environment Variables

Add each key:

OPENAI_API_KEY
ANTHROPIC_API_KEY (if using multiple providers)
NEXT_PUBLIC_SUPABASE_URL
NEXT_PUBLIC_SUPABASE_ANON_KEY
SUPABASE_SERVICE_ROLE_KEY

Two rules:

Never prefix secret keys with NEXT_PUBLIC_. That prefix exposes the variable to the browser. Your LLM API keys must only be accessible on the server.
Use different keys for preview and production. Vercel lets you scope variables to Production, Preview, or Development environments. Use separate API keys so a preview deployment does not burn your production budget.

Rate Limiting

Without rate limiting, a single user (or bot) can make hundreds of API calls in minutes and rack up a significant bill. This is the number one operational risk for AI products.

The simplest approach: count requests per user per time window using your database.

// lib/rate-limit.ts
import { createClient } from '@supabase/supabase-js'

const supabase = createClient(
  process.env.NEXT_PUBLIC_SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
)

const DAILY_LIMIT = 50 // requests per user per day

export async function checkRateLimit(userId: string) {
  const today = new Date().toISOString().split('T')[0]

  const { count } = await supabase
    .from('api_usage')
    .select('*', { count: 'exact', head: true })
    .eq('user_id', userId)
    .gte('created_at', `${today}T00:00:00Z`)

  const remaining = DAILY_LIMIT - (count ?? 0)

  return {
    allowed: remaining > 0,
    remaining,
    limit: DAILY_LIMIT
  }
}

export async function recordUsage(
  userId: string,
  tokens: number,
  latencyMs: number
) {
  await supabase.from('api_usage').insert({
    user_id: userId,
    tokens_used: tokens,
    latency_ms: latencyMs,
    created_at: new Date().toISOString()
  })
}

Start strict. You can always increase limits. You cannot claw back money from a runaway bill.

Cost Controls

Rate limiting caps request volume. Cost controls cap spending. They are different problems.

Set maxTokens on every LLM call. Without it, the model can generate an unbounded response. A single request with a long system prompt and no output limit can cost dollars, not cents.

const result = streamText({
  model: 'openai/gpt-4o-mini',
  messages,
  maxTokens: 500 // Hard ceiling on output tokens
})

Use cheaper models for non-critical paths. Not every AI call needs your best model. Classification, simple extraction, and preprocessing tasks work fine with openai/gpt-4o-mini or google/gemini-2.0-flash. Reserve the expensive models for user-facing generation where quality matters.

Set daily spend alerts. OpenAI, Anthropic, and Google all offer usage dashboards and spending limits. Set a hard cap on your provider account --- if the limit is hit, calls fail rather than billing you.

Edge Runtime vs Node.js Runtime

Vercel offers two runtimes for API routes:

// Edge Runtime - fast cold starts, runs in 30+ regions
export const runtime = 'edge'

// Node.js Runtime - full Node.js APIs, runs in one region
export const runtime = 'nodejs'

For AI routes, the choice is straightforward:

Use Edge for streaming chat routes. Edge functions have faster cold starts and run closer to the user, which means the first token arrives sooner. The AI SDK's streaming works natively on Edge.
Use Node.js for heavy processing routes like RAG ingestion, batch embedding, or agent workflows that need file system access, longer execution times, or Node-specific libraries.

Most AI chat routes should be Edge. Note that Edge functions cannot use Node.js-only APIs like fs or path, so your ingestion scripts from lesson 4 need the Node.js runtime.

Monitoring: Log Every LLM Call

You cannot optimize what you do not measure. Log four things on every LLM call:

Model --- which model handled the request.
Tokens --- input tokens, output tokens, total.
Latency --- time from request to last token.
Cost --- calculated from tokens and model pricing.

const MODEL_PRICING: Record<string, { inputPerMillion: number; outputPerMillion: number }> = {
  'openai/gpt-4o-mini': { inputPerMillion: 0.15, outputPerMillion: 0.60 },
  'openai/gpt-4o': { inputPerMillion: 2.50, outputPerMillion: 10.00 },
  'anthropic/claude-sonnet-4-20250514': { inputPerMillion: 3.00, outputPerMillion: 15.00 },
}

function calculateCost(
  model: string,
  usage: { promptTokens: number; completionTokens: number }
) {
  const pricing = MODEL_PRICING[model]
  if (!pricing) return 0

  const inputCost = (usage.promptTokens / 1_000_000) * pricing.inputPerMillion
  const outputCost = (usage.completionTokens / 1_000_000) * pricing.outputPerMillion
  return inputCost + outputCost
}

After a week of real traffic, this data tells you:

Your average cost per conversation.
Which routes are most expensive.
Whether a cheaper model would produce acceptable quality.
Where latency spikes happen.

The Pre-Launch Checklist

Before you make your AI feature public, verify every item:

[ ] API keys in environment variables, not in code. Scoped to the correct environment.
[ ] Rate limiting active. Tested by hitting the limit in preview.
[ ] maxTokens set on every streamText and generateText call.
[ ] Error handling in place. The user sees a helpful message when the LLM call fails, not a blank screen.
[ ] Cost monitoring on. Logging tokens and cost per request. Spend alerts configured on provider dashboards.
[ ] Streaming working. Tokens appear in real time, not after a multi-second delay.
[ ] Mobile tested. Chat input stays visible with the keyboard open. Messages scroll correctly.
[ ] Provider spend limits set. Hard caps on your OpenAI/Anthropic/Google accounts.

Try This

Build a /api/usage route that queries your api_usage table and returns a summary: total requests today, total tokens, estimated cost, and remaining rate limit. Then build a simple dashboard page that displays this data. This is the minimum viable observability for any AI product --- if you cannot answer "how much did AI cost me today?" you are not ready for production.

// app/api/usage/route.ts
import { createClient } from '@supabase/supabase-js'

const supabase = createClient(
  process.env.NEXT_PUBLIC_SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
)

export async function GET(request: Request) {
  const userId = await getUserId(request)
  const today = new Date().toISOString().split('T')[0]

  const { data } = await supabase
    .from('api_usage')
    .select('tokens_used, latency_ms')
    .eq('user_id', userId)
    .gte('created_at', `${today}T00:00:00Z`)

  const totalRequests = data?.length ?? 0
  const totalTokens = data?.reduce((sum, r) => sum + r.tokens_used, 0) ?? 0
  const avgLatency = totalRequests > 0
    ? Math.round(data!.reduce((sum, r) => sum + r.latency_ms, 0) / totalRequests)
    : 0

  return Response.json({
    today,
    totalRequests,
    totalTokens,
    estimatedCost: (totalTokens / 1_000_000) * 0.75, // blended rate estimate
    avgLatencyMs: avgLatency
  })
}

What Comes Next

You have built and deployed an AI product. You can make API calls, stream responses, get structured data, retrieve context from your documents, build multi-step agents, and ship it all to production with proper safeguards.

That is a significant milestone. You have crossed from "AI user" to "AI builder."

The Level 4 courses take everything you have built here and harden it for scale:

RAG Systems in Production --- chunking strategies, embedding selection, hybrid search, and the 99% cost reduction playbook.
AI Evaluation and Reliability --- building eval suites so you know your AI works before your users tell you it does not.
Voice and Chat Agent Engineering --- real-time voice pipelines, WebRTC, LiveKit, and conversational quality measurement.
Production AI Architecture --- vendor off-ramps, graceful degradation, observability, and operational runbooks.

You have the foundation. Now go build.