Adversarial Attack
An input deliberately crafted to fool an AI model into making incorrect predictions. Adversarial examples often look normal to humans but cause models to fail spectacularly.
Why It Matters
Adversarial attacks expose fundamental weaknesses in AI systems. A self-driving car that can be fooled by a sticker on a stop sign is a safety-critical vulnerability.
Example
Adding an imperceptible noise pattern to a photo of a panda that causes a classifier to confidently identify it as a gibbon — the image looks unchanged to humans.
Think of it like...
Like an optical illusion that tricks the human eye — adversarial examples exploit the model's 'perception' weaknesses in ways that are invisible to us.
Related Terms
Robustness
The ability of an AI model to maintain reliable performance when faced with unexpected inputs, adversarial attacks, data distribution changes, or edge cases.
Adversarial Training
A defense technique where adversarial examples are included in the training data to make the model more robust against attacks. The model learns to handle both normal and adversarial inputs.
Jailbreak
Techniques used to bypass an AI model's safety constraints and content policies, tricking it into generating outputs it was designed to refuse.
AI Safety
The research field focused on ensuring AI systems operate reliably, predictably, and without causing unintended harm. It spans from technical robustness to long-term existential risk concerns.