Benchmark Contamination
When a model's training data inadvertently includes test data from benchmarks, leading to inflated performance scores that do not reflect true capability.
Why It Matters
Benchmark contamination undermines the entire evaluation ecosystem. Models may appear to improve when they have simply memorized the test answers.
Example
A model scoring 95% on a coding benchmark because the exact solutions were in its training data, versus 70% on truly novel problems — the 25% gap is contamination.
Think of it like...
Like a student who somehow got the exam questions in advance — their great score does not reflect actual knowledge.
Related Terms
Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models. Benchmarks provide consistent metrics that allow fair comparisons between different approaches.
Evaluation
The systematic process of measuring an AI model's performance, safety, and reliability using various metrics, benchmarks, and testing methodologies.
Training Data
The dataset used to teach a machine learning model. It contains examples (and often labels) that the model learns patterns from during the training process. The quality and quantity of training data directly impact model performance.