Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models. Benchmarks provide consistent metrics that allow fair comparisons between different approaches.
Why It Matters
Benchmarks drive AI progress by creating measurable targets and enabling objective comparisons. However, over-optimizing for benchmarks can lead to models that game the test.
Example
MMLU (Massive Multitask Language Understanding) testing LLMs across 57 subjects, or ImageNet testing computer vision models on 1,000 object categories.
Think of it like...
Like standardized college entrance exams — they provide a consistent way to compare applicants, though they do not capture everything about a student's abilities.
Related Terms
Evaluation
The systematic process of measuring an AI model's performance, safety, and reliability using various metrics, benchmarks, and testing methodologies.
Leaderboard
A ranking of AI models by performance on specific benchmarks. Leaderboards drive competition and provide quick comparisons but can encourage gaming and narrow optimization.
Perplexity
A metric that measures how well a language model predicts text. Lower perplexity indicates the model is less 'surprised' by the text, meaning it can predict the next token more accurately.
Accuracy
The percentage of correct predictions out of all predictions made by a model. While intuitive, accuracy can be misleading for imbalanced datasets.
F1 Score
The harmonic mean of precision and recall, providing a single metric that balances both. F1 scores range from 0 to 1, with 1 being perfect precision and recall.