Start Lesson
A production-ready version of your AI chat route with rate limiting, cost monitoring, and proper error handling. By the end, you will have a pre-launch checklist you can use for every AI feature you ship.
Here is the complete production chat route that ties together everything from the course:
// app/api/chat/route.ts
import { streamText } from 'ai'
import { checkRateLimit, recordUsage } from '@/lib/rate-limit'
import { retrieveContext } from '@/lib/rag/retrieve'
export const runtime = 'edge'
export async function POST(request: Request) {
const userId = await getUserId(request)
// Rate limiting
const { allowed, remaining, limit } = await checkRateLimit(userId)
if (!allowed) {
return Response.json(
{ error: 'Daily limit reached. Resets at midnight UTC.' },
{
status: 429,
headers: {
'X-RateLimit-Limit': String(limit),
'X-RateLimit-Remaining': '0'
}
}
)
}
const { messages } = await request.json()
const latestMessage = messages[messages.length - 1].content
// RAG retrieval
const context = await retrieveContext(latestMessage)
const contextText = context.map((c) => c.content).join('\n\n---\n\n')
const startTime = Date.now()
const result = streamText({
model: 'openai/gpt-4o-mini',
system: `You are a helpful assistant. Answer based on the following context. If the context does not contain the answer, say so.
Context:
${contextText}`,
messages,
maxTokens: 500,
onFinish: async ({ usage }) => {
const latencyMs = Date.now() - startTime
await recordUsage(userId, usage.totalTokens, latencyMs)
}
})
return result.toUIMessageStreamResponse()
}
This route combines streaming (lesson 2), RAG retrieval (lesson 4), rate limiting, cost tracking, and edge runtime --- every pattern you have learned. Let us walk through the production concerns one at a time.
Your API keys live in .env.local during development. On Vercel, they go in the dashboard:
Settings > Environment Variables
Add each key:
OPENAI_API_KEYANTHROPIC_API_KEY (if using multiple providers)NEXT_PUBLIC_SUPABASE_URLNEXT_PUBLIC_SUPABASE_ANON_KEYSUPABASE_SERVICE_ROLE_KEYTwo rules:
NEXT_PUBLIC_. That prefix exposes the variable to the browser. Your LLM API keys must only be accessible on the server.Without rate limiting, a single user (or bot) can make hundreds of API calls in minutes and rack up a significant bill. This is the number one operational risk for AI products.
The simplest approach: count requests per user per time window using your database.
// lib/rate-limit.ts
import { createClient } from '@supabase/supabase-js'
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY!
)
const DAILY_LIMIT = 50 // requests per user per day
export async function checkRateLimit(userId: string) {
const today = new Date().toISOString().split('T')[0]
const { count } = await supabase
.from('api_usage')
.select('*', { count: 'exact', head: true })
.eq('user_id', userId)
.gte('created_at', `${today}T00:00:00Z`)
const remaining = DAILY_LIMIT - (count ?? 0)
return {
allowed: remaining > 0,
remaining,
limit: DAILY_LIMIT
}
}
export async function recordUsage(
userId: string,
tokens: number,
latencyMs: number
) {
await supabase.from('api_usage').insert({
user_id: userId,
tokens_used: tokens,
latency_ms: latencyMs,
created_at: new Date().toISOString()
})
}
Start strict. You can always increase limits. You cannot claw back money from a runaway bill.
Rate limiting caps request volume. Cost controls cap spending. They are different problems.
Set maxTokens on every LLM call. Without it, the model can generate an unbounded response. A single request with a long system prompt and no output limit can cost dollars, not cents.
const result = streamText({
model: 'openai/gpt-4o-mini',
messages,
maxTokens: 500 // Hard ceiling on output tokens
})
Use cheaper models for non-critical paths. Not every AI call needs your best model. Classification, simple extraction, and preprocessing tasks work fine with openai/gpt-4o-mini or google/gemini-2.0-flash. Reserve the expensive models for user-facing generation where quality matters.
Set daily spend alerts. OpenAI, Anthropic, and Google all offer usage dashboards and spending limits. Set a hard cap on your provider account --- if the limit is hit, calls fail rather than billing you.
Vercel offers two runtimes for API routes:
// Edge Runtime - fast cold starts, runs in 30+ regions
export const runtime = 'edge'
// Node.js Runtime - full Node.js APIs, runs in one region
export const runtime = 'nodejs'
For AI routes, the choice is straightforward:
Most AI chat routes should be Edge. Note that Edge functions cannot use Node.js-only APIs like fs or path, so your ingestion scripts from lesson 4 need the Node.js runtime.
You cannot optimize what you do not measure. Log four things on every LLM call:
const MODEL_PRICING: Record<string, { inputPerMillion: number; outputPerMillion: number }> = {
'openai/gpt-4o-mini': { inputPerMillion: 0.15, outputPerMillion: 0.60 },
'openai/gpt-4o': { inputPerMillion: 2.50, outputPerMillion: 10.00 },
'anthropic/claude-sonnet-4-20250514': { inputPerMillion: 3.00, outputPerMillion: 15.00 },
}
function calculateCost(
model: string,
usage: { promptTokens: number; completionTokens: number }
) {
const pricing = MODEL_PRICING[model]
if (!pricing) return 0
const inputCost = (usage.promptTokens / 1_000_000) * pricing.inputPerMillion
const outputCost = (usage.completionTokens / 1_000_000) * pricing.outputPerMillion
return inputCost + outputCost
}
After a week of real traffic, this data tells you:
Before you make your AI feature public, verify every item:
streamText and generateText call.Build a /api/usage route that queries your api_usage table and returns a summary: total requests today, total tokens, estimated cost, and remaining rate limit. Then build a simple dashboard page that displays this data. This is the minimum viable observability for any AI product --- if you cannot answer "how much did AI cost me today?" you are not ready for production.
// app/api/usage/route.ts
import { createClient } from '@supabase/supabase-js'
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY!
)
export async function GET(request: Request) {
const userId = await getUserId(request)
const today = new Date().toISOString().split('T')[0]
const { data } = await supabase
.from('api_usage')
.select('tokens_used, latency_ms')
.eq('user_id', userId)
.gte('created_at', `${today}T00:00:00Z`)
const totalRequests = data?.length ?? 0
const totalTokens = data?.reduce((sum, r) => sum + r.tokens_used, 0) ?? 0
const avgLatency = totalRequests > 0
? Math.round(data!.reduce((sum, r) => sum + r.latency_ms, 0) / totalRequests)
: 0
return Response.json({
today,
totalRequests,
totalTokens,
estimatedCost: (totalTokens / 1_000_000) * 0.75, // blended rate estimate
avgLatencyMs: avgLatency
})
}
You have built and deployed an AI product. You can make API calls, stream responses, get structured data, retrieve context from your documents, build multi-step agents, and ship it all to production with proper safeguards.
That is a significant milestone. You have crossed from "AI user" to "AI builder."
The Level 4 courses take everything you have built here and harden it for scale:
You have the foundation. Now go build.