Guardrail Model
A separate, specialized AI model that monitors the inputs and outputs of a primary LLM to detect and block harmful, off-topic, or policy-violating content.
Why It Matters
Guardrail models add a safety layer independent of the main model. Even if the primary model is tricked, the guardrail can catch policy violations.
Example
A guardrail model scanning every user input for prompt injection attempts and every model output for harmful content, blocking anything that violates policies.
Think of it like...
Like a security guard at a building entrance — separate from the building staff, specifically trained to spot threats and prevent unauthorized access.
Related Terms
Guardrails
Safety mechanisms and constraints built into AI systems to prevent harmful, inappropriate, or off-topic outputs. Guardrails can operate at the prompt, model, or output level.
AI Safety
The research field focused on ensuring AI systems operate reliably, predictably, and without causing unintended harm. It spans from technical robustness to long-term existential risk concerns.
Content Moderation
The process of monitoring and filtering user-generated or AI-generated content to ensure it meets platform guidelines and legal requirements. AI is increasingly used to automate content moderation.
Prompt Injection Defense
Techniques and strategies for protecting LLM applications from prompt injection attacks, including input sanitization, output filtering, and architectural defenses.
Classification
A type of supervised learning task where the model predicts which category or class an input belongs to. The output is a discrete label rather than a continuous value.