Artificial Intelligence

Benchmark

A standardized test or dataset used to evaluate and compare the performance of AI models. Benchmarks provide consistent metrics that allow fair comparisons between different approaches.

Why It Matters

Benchmarks drive AI progress by creating measurable targets and enabling objective comparisons. However, over-optimizing for benchmarks can lead to models that game the test.

Example

MMLU (Massive Multitask Language Understanding) testing LLMs across 57 subjects, or ImageNet testing computer vision models on 1,000 object categories.

Think of it like...

Like standardized college entrance exams — they provide a consistent way to compare applicants, though they do not capture everything about a student's abilities.

Benchmark

Why It Matters

Example

Think of it like...

Related Terms

Evaluation

Leaderboard

Perplexity

Accuracy

F1 Score