Leaderboard
A ranking of AI models by performance on specific benchmarks. Leaderboards drive competition and provide quick comparisons but can encourage gaming and narrow optimization.
Why It Matters
Leaderboards shape the AI narrative and influence purchasing decisions. Understanding their limitations prevents over-relying on rankings that may be misleading.
Example
The LMSYS Chatbot Arena leaderboard ranking LLMs by human preference through blind head-to-head comparisons, or the MMLU leaderboard ranking by multitask accuracy.
Think of it like...
Like sports league tables — they show relative standing but do not capture everything about a team's quality, strategy, or potential.
Related Terms
Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models. Benchmarks provide consistent metrics that allow fair comparisons between different approaches.
Evaluation
The systematic process of measuring an AI model's performance, safety, and reliability using various metrics, benchmarks, and testing methodologies.
Benchmark Contamination
When a model's training data inadvertently includes test data from benchmarks, leading to inflated performance scores that do not reflect true capability.