celestino.ai — A Voice Agent That Speaks for Me

Most portfolios are PDFs. A recruiter skims one for six seconds, forms an opinion, and moves on. A hiring manager might spend two minutes. Neither of them gets the full picture, and I have no way to respond to their specific questions in the moment.

I wanted something fundamentally different: an AI agent that can hold a real conversation about my work. Not a chatbot with canned responses. A voice-first agent grounded on my actual experience, deployed at a URL anyone can visit, running 24/7 in production. That agent is celestino.ai, and it is the primary CTA ("Talk to my AI") across my entire brand.

Building it forced me to solve the same problems I advise clients on: latency budgets, RAG grounding, unit economics, and reliability under real traffic. This case study walks through the engineering decisions and why I made them.

Architecture — Two Pipelines, One Agent

The system serves two interaction modes from a single codebase: voice and text chat. Both share the same RAG retrieval layer, system prompt, session management, and Supabase backend. The difference is the I/O pipeline.

Voice pipeline:

Browser Mic -> WebRTC -> LiveKit Room
  -> ElevenLabs Scribe v2 (STT)
  -> Gemini 2.5 Flash (LLM)
  -> ElevenLabs Flash v2.5 (TTS)
  -> WebRTC -> Browser Speaker

Chat pipeline:

Browser Input -> POST /api/chat
  -> RAG Retrieval (Supabase pgvector)
  -> Gemini 2.5 Flash (AI SDK streamText)
  -> SSE Stream -> Browser UI

I chose LiveKit Agents over a raw WebSocket approach because LiveKit handles the hard parts of real-time audio: room management, participant lifecycle, track subscriptions, and data channels for side-band messaging. The agent runs as a separate Node.js process that connects to a LiveKit room alongside the browser participant -- it can crash and restart without killing the user's session.

Both pipelines route to Gemini 2.5 Flash. I built a selectModel() router that can switch providers based on input mode, intent, and message length. The router exists so I can shift traffic to Anthropic or OpenAI without changing application code -- a vendor off-ramp by design.

The Latency Budget

Voice interaction has a hard constraint that text chat does not: the user is waiting in silence. Anything above one second feels like the agent is broken. Here is where every millisecond goes:

WebRTC direct connection eliminates the round-trip penalty of a WebSocket relay. Audio flows peer-to-peer between the browser and LiveKit's edge infrastructure.
Edge token exchange via a Next.js API route (/api/token) generates a LiveKit access token at the edge, not a cold-started serverless function.
Silero VAD (Voice Activity Detection) runs locally to detect speech boundaries without a server round-trip.
Multilingual turn detection provides smarter endpointing than raw VAD silence thresholds -- it distinguishes conversational pauses from mid-sentence hesitation.
ElevenLabs Flash v2.5 streams audio chunks as they are generated. The user hears the first word within ~300ms of the LLM producing text.
Preemptive generation (preemptiveGeneration: true) starts producing a response before the endpointing model confirms the user has finished. If the user continues, the draft is discarded.
Barge-in support with a minInterruptionDuration of 800ms and minInterruptionWords of 2. If the user talks over the agent, the agent stops and listens.

RAG Grounding — Making the Agent Factual

The agent needs to speak accurately about my work history, projects, and expertise. Without grounding, it would hallucinate plausible-sounding nonsense. RAG is the guardrail.

Ingestion: Content is pulled from celestinosalim.com via a sync API endpoint. Posts, projects, and service descriptions are chunked at 500 tokens with 100-token overlap -- small enough for precise retrieval, overlapping enough to preserve context at boundaries.

Embedding: Each chunk is embedded using Google's gemini-embedding-001 model at 1536 dimensions and stored in Supabase with pgvector. I chose Google embeddings over OpenAI's text-embedding-3-small because the cost per token is lower and the quality is comparable for my corpus size.

Retrieval: At query time, the user's question is embedded and matched against the document store using Supabase's match_documents RPC -- a cosine similarity search with a 0.7 threshold and top-5 results. The threshold is intentionally conservative. I would rather return fewer, highly relevant chunks than flood the context window with marginally related content.

Tool use in voice: The voice agent has a search tool registered via LiveKit's llm.tool() API. When the LLM determines it needs specific information, it calls the search tool, which runs retrieveContext() under the hood. This means the agent does not blindly stuff every response with RAG context -- it retrieves on demand, keeping token usage lean.

Cost and Rate Limiting

Running a public-facing AI agent means every visitor costs money. The unit economics have to work or the project is not viable.

Model costs: Gemini 2.5 Flash is roughly 10x cheaper per token than GPT-4. For a conversational agent where most exchanges are 2-3 sentences, this is the dominant cost lever. Voice adds ElevenLabs STT/TTS costs, but those are per-audio-second -- predictable and bounded by conversation length.

Tiered rate limiting: I implemented a three-tier system using Supabase RPC functions:

| Tier | Limit | Use Case | |------|-------|----------| | Anonymous | 3/day | Casual visitors get a taste | | Free (authenticated) | 15/day | Enough for a real conversation | | Pro (subscriber) | 500/day | Power users via Stripe subscription |

The rate limiter fails open on database errors. If Supabase is down, I would rather serve a few unmetered requests than show every visitor an error page. This is a deliberate reliability-over-precision trade-off.

Batch ingestion: Embeddings are generated in batches of 10 with Promise.all to stay within API rate limits without serializing every single chunk. A full re-index of the knowledge base runs in under a minute.

Reliability — What Happens When Things Break

Production systems fail. The question is whether users notice.

RAG failure is graceful. If retrieveContext() throws, it returns an empty array. The agent continues with its base prompt. Experience degrades from "grounded expert" to "informed generalist" -- not ideal, but far better than a crash.
Transcript noise filtering. shouldIgnoreTranscript() rejects audio that produces fewer than 2 alphanumeric characters or is entirely non-ASCII. Background noise and coughs do not trigger expensive LLM calls.
Background noise cancellation. LiveKit's BackgroundVoiceCancellation filters ambient sound before it reaches STT, improving accuracy in coffee shops and open offices.
Session persistence. Every message is saved to Supabase chat_logs with session and user IDs. Refreshes and return visits restore full history. The voice agent syncs messages to the frontend via LiveKit data channels in real time.
User memory. For authenticated users, the system maintains short-term memory (recent messages), long-term memory (extracted facts), and periodic summarization. The agent remembers you across sessions.

Results

celestino.ai is live in production, deployed on Vercel with the LiveKit agent running as a separate process. It is the primary call-to-action across every page of celestinosalim.com, every social profile, and every bio.

This is not a demo. It is a production system with auth, rate limiting, session persistence, memory, and graceful degradation. It runs the same infrastructure patterns I advocate for in client work -- because the most convincing portfolio is one that practices what it preaches.

What I Learned

Voice is harder than chat, and the gap is wider than you expect. Text chat is forgiving -- a 2-second delay feels normal. In voice, 2 seconds of silence feels like the system crashed. Every architectural decision in the voice pipeline exists to shave milliseconds. Preemptive generation, streaming TTS, and WebRTC direct connections are not optimizations; they are requirements.

RAG retrieval thresholds matter more than chunk size. I spent time tuning chunk sizes (300, 500, 800 tokens) and the quality differences were marginal. But moving the similarity threshold from 0.5 to 0.7 dramatically reduced irrelevant context bleeding into responses. A tight threshold with fewer results beats a loose threshold with more.

Rate limiting is a product decision, not just a cost decision. The three-tier system (anonymous, free, pro) is not just about controlling spend. It creates a natural funnel: try 3 free messages, sign up for 15, subscribe for 500. The rate limiter is doing marketing work.

Fail open, not closed. When Supabase is slow or unreachable, the rate limiter allows requests through. When RAG retrieval fails, the agent responds without grounding. When noise cancellation modules are unavailable, audio passes through unfiltered. Every failure mode defaults to "serve the user, accept the risk" rather than "protect the system, block the user." For a portfolio agent, this is the correct trade-off. For a banking app, it would not be.