Human Evaluation
Using human judges to assess AI model quality on subjective dimensions like helpfulness, coherence, creativity, and safety that automated metrics cannot fully capture.
Why It Matters
Human evaluation remains the gold standard for assessing LLM quality. Models can score well on benchmarks but feel unhelpful or unsafe to actual users.
Example
Having 500 raters compare responses from Model A and Model B across 1,000 questions, rating each for helpfulness, accuracy, and safety on a 5-point scale.
Think of it like...
Like restaurant reviews from actual diners versus food safety inspection scores — the numbers tell one story, but real user experience tells another.
Related Terms
Evaluation
The systematic process of measuring an AI model's performance, safety, and reliability using various metrics, benchmarks, and testing methodologies.
Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models. Benchmarks provide consistent metrics that allow fair comparisons between different approaches.
RLHF
Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.