Start Lesson
An LLM without guardrails is like a power supply without a fuse. It will work perfectly -- until it does not, and then it will damage everything downstream.
I use the term "guardrails" deliberately. In hardware engineering, engineers design protection circuits: voltage regulators, current limiters, thermal shutoffs, and surge protectors. These components exist not because the system is expected to fail, but because the consequences of unprotected failure are unacceptable. A $0.50 fuse protects a $5,000 circuit board.
AI guardrails follow the same principle. They are cheap to implement relative to the cost of a single unguarded failure -- a hallucinated legal claim, a leaked customer record, a prompt injection that exposes your system prompt, or a runaway token generation that burns through your monthly budget in an afternoon.
This lesson covers the guardrail architecture I use in every production system. It is layered, configurable, and designed to fail safe.
Drawing from NVIDIA's NeMo Guardrails framework and my own production experience, I architect guardrails in five layers. Each layer catches a different class of problem:
┌──────────────────┐
User Input ──────►│ INPUT RAILS │ Block injection, validate format
├──────────────────┤
│ DIALOG RAILS │ Enforce topic boundaries
├──────────────────┤
│ RETRIEVAL RAILS │ Filter RAG context quality
├──────────────────┤
│ EXECUTION RAILS │ Validate tool/action calls
├──────────────────┤
│ OUTPUT RAILS │ Final content/quality check
└───────┬──────────┘
│
Safe Response ──────► User
Input rails process user messages before they reach the LLM. This is your first line of defense:
Prompt injection detection. Users -- sometimes intentionally, sometimes through copied text -- can include instructions that override your system prompt. Input rails detect patterns like "ignore previous instructions," role-play manipulation, and encoding-based injection attempts.
Format validation. Reject inputs that exceed token limits, contain malformed data, or include binary content that should not be in a text prompt.
PII detection. If your system should not process personal data, catch it at the input layer. Do not send social security numbers, credit card numbers, or health records to an external LLM API.
class InputRails:
def __init__(self, config: RailsConfig):
self.injection_detector = InjectionDetector()
self.pii_scanner = PIIScanner()
self.token_limit = config.max_input_tokens
def validate(self, user_input: str) -> RailResult:
# Check token limit
token_count = count_tokens(user_input)
if token_count > self.token_limit:
return RailResult.blocked(
reason="input_too_long",
user_message="Your message is too long. "
"Please keep it under "
f"{self.token_limit} tokens."
)
# Check for injection attempts
injection_score = self.injection_detector.score(user_input)
if injection_score > 0.85:
log_security_event("injection_attempt", user_input)
return RailResult.blocked(
reason="injection_detected",
user_message="I cannot process that request."
)
# Check for PII
pii_findings = self.pii_scanner.scan(user_input)
if pii_findings:
return RailResult.blocked(
reason="pii_detected",
user_message="Please remove personal information "
"before submitting."
)
return RailResult.passed()
Dialog rails control the conversation boundaries. They enforce what topics the AI can and cannot discuss:
Topic boundaries. A customer support AI should not provide medical advice, legal opinions, or political commentary. Dialog rails maintain an allow-list of topics and redirect off-topic requests.
Conversation flow enforcement. For structured interactions (onboarding, troubleshooting), dialog rails ensure the conversation follows the designed path and does not wander.
I implement dialog rails primarily through system prompt engineering combined with a lightweight classifier that routes off-topic messages to a polite refusal:
TOPIC_CLASSIFIER_PROMPT = """
Classify this user message into one of these categories:
- ON_TOPIC: Related to {allowed_topics}
- OFF_TOPIC: Not related to the above
- AMBIGUOUS: Could be related, needs clarification
Message: {user_message}
Category:
"""
The key design decision: dialog rails should be configurable per deployment, not hardcoded. A system deployed for customer support has different topic boundaries than the same system deployed for internal knowledge management.
In RAG (Retrieval-Augmented Generation) systems, the retrieved context is as dangerous as user input. Retrieval rails filter the context before it reaches the LLM:
Relevance filtering. Reject retrieved chunks below a similarity threshold. Irrelevant context increases hallucination risk and token costs.
Staleness detection. Flag or exclude documents that are past their review date. An AI citing a two-year-old pricing document is a liability.
Source authority. Weight or filter context based on source reliability. Internal documentation outranks forum posts.
When your AI system can take actions -- calling APIs, writing to databases, sending emails -- execution rails are the safety valves that prevent catastrophic actions:
Action allow-listing. The model can only call explicitly approved functions. No dynamic function generation.
Parameter validation. Even approved actions get their parameters validated before execution. A "send_email" action should verify the recipient is in the approved domain list.
Confirmation gates. High-impact actions (deleting data, sending to external systems, financial transactions) require explicit confirmation before execution.
class ExecutionRails:
REQUIRES_CONFIRMATION = {
"delete_record", "send_external_email",
"process_refund", "modify_subscription"
}
def validate_action(self, action: Action) -> RailResult:
if action.name not in self.allowed_actions:
return RailResult.blocked(
reason="unauthorized_action"
)
if not self.validate_params(action):
return RailResult.blocked(
reason="invalid_parameters"
)
if action.name in self.REQUIRES_CONFIRMATION:
return RailResult.needs_confirmation(
action=action,
message=f"I'd like to {action.description}. "
"Should I proceed?"
)
return RailResult.passed()
Output rails are the final quality gate before the response reaches the user:
Content safety. Check for harmful, biased, or inappropriate content. NVIDIA's content safety models and Llama Guard provide classifier-based checking that runs in milliseconds.
Factuality checks. For systems that should only state verifiable facts, output rails can compare claims against the retrieved context (grounding check) or flag confident-sounding statements that lack source support.
Format compliance. Ensure structured outputs (JSON, specific templates) conform to the expected schema. Reject and retry malformed responses.
Beyond content guardrails, I implement financial safety valves in every production system. These are the circuit breakers for your budget:
class CostSafetyValve:
def __init__(self, config: CostConfig):
self.hourly_limit = config.hourly_limit
self.daily_limit = config.daily_limit
self.per_request_limit = config.per_request_limit
self.current_hour_spend = 0.0
self.current_day_spend = 0.0
def check(self, estimated_cost: float) -> bool:
if estimated_cost > self.per_request_limit:
alert("per_request_cost_exceeded", estimated_cost)
return False
if self.current_hour_spend + estimated_cost > self.hourly_limit:
alert("hourly_budget_exceeded", self.current_hour_spend)
return False
if self.current_day_spend + estimated_cost > self.daily_limit:
alert("daily_budget_exceeded", self.current_day_spend)
return False
return True
I set these limits at three levels:
When a safety valve trips, the system does not crash. It degrades to a cached response or a polite "service is temporarily limited" message. The user gets a response. Your budget stays intact.
NeMo Guardrails by NVIDIA is the most comprehensive open-source option. It supports all five rail types, integrates with most LLM providers, and uses a domain-specific language called Colang for defining conversation flows. Its latest release supports streaming content through output rails and multilingual content safety.
Guardrails AI takes a different approach, focusing on structured output validation using a RAIL (Robust AI Language) specification. It excels at ensuring outputs conform to specific schemas and data types.
Custom implementation is what I recommend for production systems with specific requirements. Use the frameworks as inspiration, but build the rails that match your actual risk profile. A B2B analytics tool needs different guardrails than a consumer-facing chatbot.
Guardrails are only as good as their testing. I maintain an adversarial test suite for every guardrail layer:
| Guardrail Layer | Always Needed | Needed for Customer-Facing | Skip for Internal Tools | |----------------|--------------|---------------------------|------------------------| | Input rails (token limits, basic validation) | Yes | Yes | Simplified version | | Input rails (injection detection) | No -- internal tools with trusted users can skip | Yes | Usually skip | | Dialog rails (topic boundaries) | No -- only for scoped assistants | Yes | Usually skip | | Retrieval rails | Only if using RAG | Only if using RAG | Only if using RAG | | Execution rails | Only if AI can take actions | Yes -- non-negotiable for actions | Yes -- actions are actions regardless of audience | | Output rails (content safety) | No -- depends on risk profile | Yes | Usually skip | | Output rails (format compliance) | Yes -- malformed output breaks downstream systems regardless | Yes | Yes | | Financial safety valves | Yes -- always | Yes | Yes -- a runaway cost spike does not care who the user is |
The principle: input/output format validation and financial safety valves are always justified. Content guardrails scale with the risk profile of your deployment. An internal analytics tool used by five engineers does not need the same injection detection as a consumer chatbot serving millions of users.
Before deploying any AI system with user-facing interactions:
A guardrail that has never been tested is not a guardrail. It is a hope.
Guardrails protect against bad outputs. But what happens when your AI provider goes down entirely? The next lesson covers graceful degradation -- the four-tier hierarchy that keeps your system useful even when the LLM is unavailable. We build circuit breakers, retry strategies, cached response layers, and the chaos engineering practices that prove your fallbacks actually work.