Jailbreak
Techniques used to bypass an AI model's safety constraints and content policies, tricking it into generating outputs it was designed to refuse.
Why It Matters
Jailbreaking exposes AI safety vulnerabilities and drives the adversarial testing that makes models more robust. It is a cat-and-mouse game between attackers and defenders.
Example
A prompt that frames a harmful request as a fictional scenario, roleplay, or hypothetical to circumvent the model's refusal to answer such questions directly.
Think of it like...
Like finding loopholes in rules — the rules say you cannot do X directly, so you find an indirect way to achieve the same result.
Related Terms
Prompt Injection
A security vulnerability where malicious input is crafted to override or manipulate an LLM's system prompt or instructions, causing it to behave in unintended ways.
Red Teaming
The practice of systematically testing AI systems by attempting to find failures, vulnerabilities, and harmful behaviors before deployment. Red teamers actively try to break the system.
AI Safety
The research field focused on ensuring AI systems operate reliably, predictably, and without causing unintended harm. It spans from technical robustness to long-term existential risk concerns.
Guardrails
Safety mechanisms and constraints built into AI systems to prevent harmful, inappropriate, or off-topic outputs. Guardrails can operate at the prompt, model, or output level.
Adversarial Attack
An input deliberately crafted to fool an AI model into making incorrect predictions. Adversarial examples often look normal to humans but cause models to fail spectacularly.