Reward Modeling
Training a separate model to predict human preferences, which then serves as the reward signal for reinforcement learning. The reward model learns what humans consider 'good' responses.
Why It Matters
Reward modeling is the critical bridge between human judgment and AI optimization. A flawed reward model leads to AI that optimizes for the wrong thing.
Example
Training a reward model on 100,000 human comparisons of response pairs, then using it to score millions of model outputs during RLHF training.
Think of it like...
Like training a wine judge by having them learn from master sommeliers — they internalize the standards and can then evaluate wines independently.
Related Terms
Reward Model
A model trained to predict how good a response is based on human preferences. In RLHF, the reward model scores outputs to guide the language model toward responses humans prefer.
RLHF
Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.
Alignment
The challenge of ensuring AI systems behave in ways that match human values, intentions, and expectations. Alignment aims to make AI helpful, honest, and harmless.
Preference Optimization
Training techniques that directly optimize models based on human preference data, where humans indicate which of two model outputs they prefer.
Reinforcement Learning
A type of machine learning where an agent learns to make decisions by taking actions in an environment and receiving rewards or penalties. The agent aims to maximize cumulative reward over time through trial and error.