Machine Learning

Reward Modeling

Training a separate model to predict human preferences, which then serves as the reward signal for reinforcement learning. The reward model learns what humans consider 'good' responses.

Why It Matters

Reward modeling is the critical bridge between human judgment and AI optimization. A flawed reward model leads to AI that optimizes for the wrong thing.

Example

Training a reward model on 100,000 human comparisons of response pairs, then using it to score millions of model outputs during RLHF training.

Think of it like...

Like training a wine judge by having them learn from master sommeliers — they internalize the standards and can then evaluate wines independently.

Related Terms