Preference Optimization
Training techniques that directly optimize models based on human preference data, where humans indicate which of two model outputs they prefer.
Why It Matters
Preference optimization is how models learn to be helpful, honest, and harmless. It translates subjective human judgment into mathematical optimization.
Example
Showing raters two model responses to the same question, collecting their preference, then training the model to produce outputs more like the preferred ones.
Think of it like...
Like a cooking competition where judges taste two dishes and pick the better one — over many rounds, the chef learns to cook what judges prefer.
Related Terms
RLHF
Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.
DPO
Direct Preference Optimization — a simpler alternative to RLHF that directly optimizes a language model from human preference data without needing a separate reward model. It is more stable and easier to implement.
Reward Model
A model trained to predict how good a response is based on human preferences. In RLHF, the reward model scores outputs to guide the language model toward responses humans prefer.
Alignment
The challenge of ensuring AI systems behave in ways that match human values, intentions, and expectations. Alignment aims to make AI helpful, honest, and harmless.
Human Evaluation
Using human judges to assess AI model quality on subjective dimensions like helpfulness, coherence, creativity, and safety that automated metrics cannot fully capture.