Start Lesson
A chat route that answers questions about your own data. When the user asks a question, your app finds the most relevant chunks from your documents, injects them into the prompt, and the model answers from your actual content instead of guessing.
Here is the finished chat route with retrieval:
// app/api/chat/route.ts
import { streamText } from 'ai'
import { retrieveContext } from '@/lib/rag/retrieve'
export async function POST(request: Request) {
const { messages } = await request.json()
const latestMessage = messages[messages.length - 1].content
// Retrieve relevant chunks from your data
const context = await retrieveContext(latestMessage)
const contextText = context.map((c) => c.content).join('\n\n---\n\n')
const result = streamText({
model: 'openai/gpt-4o-mini',
system: `You are a helpful assistant. Answer questions based on the following context. If the context does not contain the answer, say so honestly.
Context:
${contextText}`,
messages,
maxTokens: 500
})
return result.toUIMessageStreamResponse()
}
That is a working RAG endpoint. The client-side useChat from lesson 2 works with it unchanged. The user asks a question, your app finds relevant documents, and the model answers from your data. Let us build the four pieces that make retrieveContext work.
1. CHUNK - Split your documents into small pieces
2. EMBED - Convert each piece into a vector (array of numbers)
3. STORE - Save the vectors in a database
4. RETRIEVE - When a user asks a question, find the most relevant pieces
and stuff them into the prompt as context
That is the entire architecture.
Your documents are too long to fit in a single prompt. You need to break them into pieces that are small enough to be precisely retrievable but large enough to contain a complete thought.
// lib/rag/chunk.ts
export function chunkText(
text: string,
chunkSize = 500,
overlap = 50
): string[] {
const words = text.split(/\s+/)
const chunks: string[] = []
for (let i = 0; i < words.length; i += chunkSize - overlap) {
const chunk = words.slice(i, i + chunkSize).join(' ')
if (chunk.trim()) chunks.push(chunk)
}
return chunks
}
// Usage
const document = `Your long document text here...`
const chunks = chunkText(document)
// Returns an array of ~500-word chunks with 50-word overlap
Why ~500 tokens? Smaller chunks are more precisely retrievable --- when the user asks a specific question, a 500-token chunk about that exact topic will match better than a 2000-token chunk that mentions it in passing.
Why overlap? Without it, a critical sentence that falls on a boundary gets split between two chunks, and neither chunk contains the complete thought. A 10-15% overlap ensures continuity.
An embedding converts text into a vector --- an array of numbers that represents the meaning of that text. Similar meanings produce similar vectors. This is what makes retrieval possible: when the user asks a question, you embed the question and find the stored chunks with the most similar vectors.
The AI SDK provides embed and embedMany functions so you do not need raw fetch calls:
// lib/rag/embed.ts
import { embed, embedMany } from 'ai'
export async function embedText(text: string): Promise<number[]> {
const { embedding } = await embed({
model: 'openai/text-embedding-3-small',
value: text
})
return embedding // 1536-dimension vector
}
export async function embedChunks(chunks: string[]) {
const { embeddings } = await embedMany({
model: 'openai/text-embedding-3-small',
values: chunks
})
return chunks.map((chunk, index) => ({
content: chunk,
embedding: embeddings[index],
metadata: { chunkIndex: index }
}))
}
embed handles a single input. embedMany handles a batch --- it is more efficient than calling embed in a loop because the SDK batches the API call.
text-embedding-3-small returns a 1536-dimension vector for each input. It costs $0.02 per million tokens --- embedding a 100-page document costs about one cent. The cost is negligible compared to the generation step.
You need a database that can store vectors and search them efficiently. Supabase with the pgvector extension does this with a single SQL table.
First, enable the extension and create the table:
-- Run this in your Supabase SQL editor
create extension if not exists vector;
create table documents (
id bigserial primary key,
content text not null,
embedding vector(1536) not null,
metadata jsonb default '{}'::jsonb,
created_at timestamptz default now()
);
-- Create an index for fast similarity search
create index on documents
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
Then write a function to insert chunks:
// lib/rag/store.ts
import { createClient } from '@supabase/supabase-js'
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY!
)
export async function storeChunks(
chunks: { content: string; embedding: number[]; metadata: object }[]
) {
const { error } = await supabase.from('documents').insert(
chunks.map((chunk) => ({
content: chunk.content,
embedding: JSON.stringify(chunk.embedding),
metadata: chunk.metadata
}))
)
if (error) throw new Error(`Failed to store chunks: ${error.message}`)
}
When a user asks a question, embed the question and find the most similar chunks using cosine similarity. Create a Supabase RPC function for this:
-- Supabase SQL editor
create or replace function match_documents(
query_embedding vector(1536),
match_count int default 5,
match_threshold float default 0.7
)
returns table (
id bigint,
content text,
metadata jsonb,
similarity float
)
language plpgsql
as $$
begin
return query
select
documents.id,
documents.content,
documents.metadata,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where 1 - (documents.embedding <=> query_embedding) > match_threshold
order by documents.embedding <=> query_embedding
limit match_count;
end;
$$;
Now call it from TypeScript:
// lib/rag/retrieve.ts
import { createClient } from '@supabase/supabase-js'
import { embedText } from './embed'
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY!
)
export async function retrieveContext(query: string) {
const queryEmbedding = await embedText(query)
const { data, error } = await supabase.rpc('match_documents', {
query_embedding: JSON.stringify(queryEmbedding),
match_count: 5,
match_threshold: 0.7
})
if (error) throw new Error(`Retrieval failed: ${error.message}`)
return data as { content: string; similarity: number }[]
}
You need to run the chunk-embed-store pipeline once for your documents. Here is a complete ingestion script:
// scripts/ingest.ts
import { chunkText } from '@/lib/rag/chunk'
import { embedChunks } from '@/lib/rag/embed'
import { storeChunks } from '@/lib/rag/store'
import { readFileSync } from 'fs'
async function ingest(filePath: string) {
const text = readFileSync(filePath, 'utf-8')
console.log(`Read ${text.length} characters from ${filePath}`)
const chunks = chunkText(text)
console.log(`Created ${chunks.length} chunks`)
const embedded = await embedChunks(chunks)
console.log(`Generated ${embedded.length} embeddings`)
await storeChunks(embedded)
console.log('Stored in Supabase. Done.')
}
ingest('./content/your-document.txt')
Run it with npx tsx scripts/ingest.ts. Your documents are now searchable.
Chunks too large. A 2000-token chunk embeds the average meaning of a long passage. When the user asks a specific question, the embedding match is weak because the relevant sentence is diluted by everything around it. Keep chunks around 500 tokens.
No overlap between chunks. A key sentence split across two chunks means neither chunk contains the full thought. Use 10-15% overlap.
Too many chunks in the prompt. Retrieving 20 chunks and stuffing them all into the system prompt wastes tokens and confuses the model. The most relevant information gets buried. Five chunks is a good default --- increase only if you measure that recall improves.
Not setting a similarity threshold. Without a minimum similarity score, you retrieve the "least irrelevant" chunks even when none are actually relevant. A threshold of 0.7 filters out noise and lets the model say "I don't have information about that" when appropriate.
Add a /api/ingest route that accepts a POST with a text field, runs the chunk-embed-store pipeline, and returns the number of chunks created. Then build a simple form that lets you paste a document and ingest it through the browser. This gives you a self-service way to add new content to your RAG pipeline without running scripts.
// app/api/ingest/route.ts
import { chunkText } from '@/lib/rag/chunk'
import { embedChunks } from '@/lib/rag/embed'
import { storeChunks } from '@/lib/rag/store'
export async function POST(request: Request) {
const { text } = await request.json()
const chunks = chunkText(text)
const embedded = await embedChunks(chunks)
await storeChunks(embedded)
return Response.json({ chunksCreated: chunks.length })
}
You have a pipeline that grounds the model's answers in your data. But it is still one question, one answer. In the next lesson, you build agents that plan, decide, and act across multiple steps --- combining tool use and chained LLM calls to complete tasks, not just answer questions.