Adding RAG to Your App | Building Your First AI Product | Celestinosalim.com

Adding RAG to Your App

What You Will Build

A chat route that answers questions about your own data. When the user asks a question, your app finds the most relevant chunks from your documents, injects them into the prompt, and the model answers from your actual content instead of guessing.

Here is the finished chat route with retrieval:

// app/api/chat/route.ts
import { streamText } from 'ai'
import { retrieveContext } from '@/lib/rag/retrieve'

export async function POST(request: Request) {
  const { messages } = await request.json()
  const latestMessage = messages[messages.length - 1].content

  // Retrieve relevant chunks from your data
  const context = await retrieveContext(latestMessage)
  const contextText = context.map((c) => c.content).join('\n\n---\n\n')

  const result = streamText({
    model: 'openai/gpt-4o-mini',
    system: `You are a helpful assistant. Answer questions based on the following context. If the context does not contain the answer, say so honestly.

Context:
${contextText}`,
    messages,
    maxTokens: 500
  })

  return result.toUIMessageStreamResponse()
}

That is a working RAG endpoint. The client-side useChat from lesson 2 works with it unchanged. The user asks a question, your app finds relevant documents, and the model answers from your data. Let us build the four pieces that make retrieveContext work.

RAG in Four Steps

1. CHUNK    - Split your documents into small pieces
2. EMBED    - Convert each piece into a vector (array of numbers)
3. STORE    - Save the vectors in a database
4. RETRIEVE - When a user asks a question, find the most relevant pieces
               and stuff them into the prompt as context

That is the entire architecture.

Step 1: Chunking

Your documents are too long to fit in a single prompt. You need to break them into pieces that are small enough to be precisely retrievable but large enough to contain a complete thought.

// lib/rag/chunk.ts
export function chunkText(
  text: string,
  chunkSize = 500,
  overlap = 50
): string[] {
  const words = text.split(/\s+/)
  const chunks: string[] = []

  for (let i = 0; i < words.length; i += chunkSize - overlap) {
    const chunk = words.slice(i, i + chunkSize).join(' ')
    if (chunk.trim()) chunks.push(chunk)
  }

  return chunks
}

// Usage
const document = `Your long document text here...`
const chunks = chunkText(document)
// Returns an array of ~500-word chunks with 50-word overlap

Why ~500 tokens? Smaller chunks are more precisely retrievable --- when the user asks a specific question, a 500-token chunk about that exact topic will match better than a 2000-token chunk that mentions it in passing.

Why overlap? Without it, a critical sentence that falls on a boundary gets split between two chunks, and neither chunk contains the complete thought. A 10-15% overlap ensures continuity.

Step 2: Embedding with the AI SDK

An embedding converts text into a vector --- an array of numbers that represents the meaning of that text. Similar meanings produce similar vectors. This is what makes retrieval possible: when the user asks a question, you embed the question and find the stored chunks with the most similar vectors.

The AI SDK provides embed and embedMany functions so you do not need raw fetch calls:

// lib/rag/embed.ts
import { embed, embedMany } from 'ai'

export async function embedText(text: string): Promise<number[]> {
  const { embedding } = await embed({
    model: 'openai/text-embedding-3-small',
    value: text
  })
  return embedding // 1536-dimension vector
}

export async function embedChunks(chunks: string[]) {
  const { embeddings } = await embedMany({
    model: 'openai/text-embedding-3-small',
    values: chunks
  })

  return chunks.map((chunk, index) => ({
    content: chunk,
    embedding: embeddings[index],
    metadata: { chunkIndex: index }
  }))
}

embed handles a single input. embedMany handles a batch --- it is more efficient than calling embed in a loop because the SDK batches the API call.

text-embedding-3-small returns a 1536-dimension vector for each input. It costs $0.02 per million tokens --- embedding a 100-page document costs about one cent. The cost is negligible compared to the generation step.

Step 3: Storage with Supabase and pgvector

You need a database that can store vectors and search them efficiently. Supabase with the pgvector extension does this with a single SQL table.

First, enable the extension and create the table:

-- Run this in your Supabase SQL editor
create extension if not exists vector;

create table documents (
  id bigserial primary key,
  content text not null,
  embedding vector(1536) not null,
  metadata jsonb default '{}'::jsonb,
  created_at timestamptz default now()
);

-- Create an index for fast similarity search
create index on documents
  using ivfflat (embedding vector_cosine_ops)
  with (lists = 100);

Then write a function to insert chunks:

// lib/rag/store.ts
import { createClient } from '@supabase/supabase-js'

const supabase = createClient(
  process.env.NEXT_PUBLIC_SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
)

export async function storeChunks(
  chunks: { content: string; embedding: number[]; metadata: object }[]
) {
  const { error } = await supabase.from('documents').insert(
    chunks.map((chunk) => ({
      content: chunk.content,
      embedding: JSON.stringify(chunk.embedding),
      metadata: chunk.metadata
    }))
  )

  if (error) throw new Error(`Failed to store chunks: ${error.message}`)
}

Step 4: Retrieval

When a user asks a question, embed the question and find the most similar chunks using cosine similarity. Create a Supabase RPC function for this:

-- Supabase SQL editor
create or replace function match_documents(
  query_embedding vector(1536),
  match_count int default 5,
  match_threshold float default 0.7
)
returns table (
  id bigint,
  content text,
  metadata jsonb,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    documents.id,
    documents.content,
    documents.metadata,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where 1 - (documents.embedding <=> query_embedding) > match_threshold
  order by documents.embedding <=> query_embedding
  limit match_count;
end;
$$;

Now call it from TypeScript:

// lib/rag/retrieve.ts
import { createClient } from '@supabase/supabase-js'
import { embedText } from './embed'

const supabase = createClient(
  process.env.NEXT_PUBLIC_SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
)

export async function retrieveContext(query: string) {
  const queryEmbedding = await embedText(query)

  const { data, error } = await supabase.rpc('match_documents', {
    query_embedding: JSON.stringify(queryEmbedding),
    match_count: 5,
    match_threshold: 0.7
  })

  if (error) throw new Error(`Retrieval failed: ${error.message}`)
  return data as { content: string; similarity: number }[]
}

The Ingestion Script

You need to run the chunk-embed-store pipeline once for your documents. Here is a complete ingestion script:

// scripts/ingest.ts
import { chunkText } from '@/lib/rag/chunk'
import { embedChunks } from '@/lib/rag/embed'
import { storeChunks } from '@/lib/rag/store'
import { readFileSync } from 'fs'

async function ingest(filePath: string) {
  const text = readFileSync(filePath, 'utf-8')
  console.log(`Read ${text.length} characters from ${filePath}`)

  const chunks = chunkText(text)
  console.log(`Created ${chunks.length} chunks`)

  const embedded = await embedChunks(chunks)
  console.log(`Generated ${embedded.length} embeddings`)

  await storeChunks(embedded)
  console.log('Stored in Supabase. Done.')
}

ingest('./content/your-document.txt')

Run it with npx tsx scripts/ingest.ts. Your documents are now searchable.

Common Pitfalls

Chunks too large. A 2000-token chunk embeds the average meaning of a long passage. When the user asks a specific question, the embedding match is weak because the relevant sentence is diluted by everything around it. Keep chunks around 500 tokens.

No overlap between chunks. A key sentence split across two chunks means neither chunk contains the full thought. Use 10-15% overlap.

Too many chunks in the prompt. Retrieving 20 chunks and stuffing them all into the system prompt wastes tokens and confuses the model. The most relevant information gets buried. Five chunks is a good default --- increase only if you measure that recall improves.

Not setting a similarity threshold. Without a minimum similarity score, you retrieve the "least irrelevant" chunks even when none are actually relevant. A threshold of 0.7 filters out noise and lets the model say "I don't have information about that" when appropriate.

Try This

Add a /api/ingest route that accepts a POST with a text field, runs the chunk-embed-store pipeline, and returns the number of chunks created. Then build a simple form that lets you paste a document and ingest it through the browser. This gives you a self-service way to add new content to your RAG pipeline without running scripts.

// app/api/ingest/route.ts
import { chunkText } from '@/lib/rag/chunk'
import { embedChunks } from '@/lib/rag/embed'
import { storeChunks } from '@/lib/rag/store'

export async function POST(request: Request) {
  const { text } = await request.json()
  const chunks = chunkText(text)
  const embedded = await embedChunks(chunks)
  await storeChunks(embedded)
  return Response.json({ chunksCreated: chunks.length })
}

What's Next

You have a pipeline that grounds the model's answers in your data. But it is still one question, one answer. In the next lesson, you build agents that plan, decide, and act across multiple steps --- combining tool use and chained LLM calls to complete tasks, not just answer questions.