Start Lesson
A RAG system without citations is a chatbot. A RAG system with precise, verifiable citations is a research tool. The difference is trust, and trust is what makes users come back.
I have found that citation quality is the single strongest predictor of user retention in enterprise RAG products. Users do not just want answers --- they want to verify those answers against the source material. Research from the Allen Institute for AI shows that citation accuracy in RAG systems averages only 65--70% without explicit attribution mechanisms. That means roughly one in three citations is wrong or unsupported. In high-stakes domains, this destroys credibility.
This lesson covers how to build citation systems that are accurate, granular, and useful.
The simplest form. Cite which document the answer came from.
Answer: "The refund window is 30 days for physical products."
Source: billing-policy.pdf
Pros: Easy to implement. Better than no citation. Cons: The user still has to search the entire document to verify the claim. For a 50-page PDF, this is barely better than no citation at all.
Cite the specific chunk that was retrieved.
Answer: "The refund window is 30 days for physical products." [1]
[1] billing-policy.pdf, Section 4.2: "Refund Policy"
"Customers may request a full refund within 30 days of purchase
for all physical products. Digital products are eligible for
store credit only."
Pros: Users can verify the claim against a small, specific passage. This is where most production systems should aim. Cons: Requires clean chunk boundaries and good section metadata.
Cite the exact sentence or phrase supporting each claim.
Answer: "The refund window is 30 days [1] for physical products [1],
but digital products only qualify for store credit [2]."
[1] billing-policy.pdf, Section 4.2, Paragraph 1:
"Customers may request a full refund within 30 days of purchase
for all physical products."
[2] billing-policy.pdf, Section 4.2, Paragraph 2:
"Digital products are eligible for store credit only."
Pros: Maximum verifiability. Essential for legal, medical, and financial applications. Cons: Requires the LLM to perform fine-grained attribution, which adds complexity and latency.
This is the practical sweet spot for most production systems. Here is the architecture.
Every chunk must carry enough metadata to generate a meaningful citation:
def create_citable_chunk(text, source_doc, section_path, page_num=None):
return {
"text": text,
"metadata": {
"source_id": source_doc.id,
"source_title": source_doc.title,
"source_url": source_doc.url, # For linking back
"section_path": section_path, # e.g., "Billing > Refunds > Exceptions"
"page_number": page_num,
"chunk_hash": hashlib.sha256(text.encode()).hexdigest()[:12],
"indexed_at": datetime.utcnow().isoformat()
}
}
The key insight: Citation quality is determined at index time, not generation time. If you lose source information during chunking, no prompt engineering can recover it.
Include explicit citation instructions in your system prompt:
SYSTEM_PROMPT = """You are a helpful assistant that answers questions
based on the provided context documents.
CITATION RULES:
1. Only use information from the provided context documents.
2. Cite every factual claim using [N] notation, where N corresponds
to the source number.
3. If the context does not contain enough information to answer,
say "I don't have enough information to answer this" rather
than guessing.
4. Never combine information from different sources without
citing each source separately.
5. If two sources conflict, present both views with their citations.
CONTEXT DOCUMENTS:
{formatted_context}
"""
def format_context_with_citations(chunks):
formatted = []
for i, chunk in enumerate(chunks, 1):
source = chunk["metadata"]
formatted.append(
f"[Source {i}] {source['source_title']} > {source['section_path']}\n"
f"{chunk['text']}\n"
)
return "\n---\n".join(formatted)
The LLM will sometimes hallucinate citation numbers or cite the wrong source. Always validate:
import re
def validate_citations(response: str, num_sources: int):
"""Extract and validate citation references in the response."""
citations = re.findall(r'\[(\d+)\]', response)
citations = [int(c) for c in citations]
issues = []
for c in citations:
if c < 1 or c > num_sources:
issues.append(f"Citation [{c}] references non-existent source")
# Check for uncited claims (sentences without any citation)
sentences = re.split(r'[.!?]+', response)
uncited = [s.strip() for s in sentences
if s.strip() and not re.search(r'\[\d+\]', s)
and len(s.strip().split()) > 5]
if uncited:
issues.append(f"{len(uncited)} sentences lack citations")
return {
"valid": len(issues) == 0,
"citations_found": citations,
"issues": issues
}
When retrieved chunks disagree, the system must present both perspectives:
Answer: "The standard processing time is 5-7 business days [1],
though the updated 2025 policy indicates 3-5 business
days for premium members [2]."
Implementation: Add a conflict detection step that checks for contradictory information across retrieved chunks before sending to the LLM. When conflicts are detected, modify the prompt to explicitly instruct the LLM to present both views.
Some answers require synthesizing information from multiple sources:
Answer: "The annual revenue was $12M [1] with operating costs of
$8M [2], resulting in a net margin of approximately 33%
[calculated from sources 1 and 2]."
Implementation: Allow the LLM to indicate when a claim is derived from multiple sources. The notation [calculated from sources 1 and 2] signals to the user that this is a synthesis, not a direct quote.
The most important citation is the absence of one. When the system cannot find supporting evidence, it should say so:
NO_ANSWER_PROMPT = """If you cannot find sufficient evidence in the
provided context to answer the question, respond with:
"I don't have enough information in the available documents to
answer this question. The closest related information I found is:
[brief summary of what IS available, with citations]."
Do NOT attempt to answer from your general knowledge."""
This is a guardrail against hallucination. Users trust a system that admits its limits far more than one that confidently fabricates answers.
Track these metrics in production:
citation_metrics = {
# What percentage of responses include at least one citation?
"citation_coverage": cited_responses / total_responses,
# What percentage of citations point to valid source chunks?
"citation_validity": valid_citations / total_citations,
# What percentage of cited claims are actually supported by the source?
"citation_faithfulness": supported_claims / cited_claims,
# What percentage of factual sentences have citations?
"sentence_citation_rate": cited_sentences / factual_sentences,
}
Targets I aim for:
Use this checklist to assess your citation implementation:
If your citation validity is below 99%, treat invalid citations as bugs. A citation pointing to the wrong source is worse than no citation at all -- it actively misleads the user and destroys the trust you are trying to build.
We build the measurement system: eval harnesses for retrieval quality. You cannot improve citation quality, retrieval recall, or answer faithfulness without a systematic way to measure them. Lesson 7 shows how to build eval suites that run in CI and catch regressions before deployment.