Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
RLHF (reinforcement learning from human feedback) is powerful but complex: it requires training a separate reward model, then running PPO (proximal policy optimization) — a notoriously unstable RL algorithm — to update the LLM policy. DPO (Direct Preference Optimization) achieves the same goal — making the model prefer 'chosen' responses over 'rejected' ones — without a reward model and without RL. It reformulates the preference learning problem as a supervised classification loss that can be computed directly on the policy. In practice, DPO is simpler to implement, more stable to train, and produces comparable or better results than RLHF on most instruction-following benchmarks.
DPO sidesteps the reward model by observing that the optimal policy under a KL-constrained RL objective has a closed form: it's proportional to the reference policy scaled by exp(r/β). Substituting that back into the preference probability yields a loss you can compute directly on (prompt, chosen, rejected) triples. The formula -log σ(β · [(log π(c) − log π(r)) − (log π_ref(c) − log π_ref(r))]) says: push the policy to prefer chosen over rejected by a margin larger than the reference already does, with β controlling how aggressively.
import torch
import torch.nn.functional as F
def dpo_loss(policy_chosen_logp, policy_rejected_logp,
ref_chosen_logp, ref_rejected_logp,
beta=0.1):
"""
DPO loss (Rafailov et al. 2023).
policy_*_logp : log-prob of chosen/rejected response under the model being trained
ref_*_logp : log-prob under the frozen reference model (the SFT checkpoint)
beta : KL penalty coefficient — higher = stay closer to reference
"""
pi_logratios = policy_chosen_logp - policy_rejected_logp
ref_logratios = ref_chosen_logp - ref_rejected_logp
logits = beta * (pi_logratios - ref_logratios)
loss = -F.logsigmoid(logits).mean()
return loss
# Toy example: model prefers chosen (as it should after training)
pol_chosen = torch.tensor([-2.0, -1.5, -1.8]) # log-probs per example
pol_rejected = torch.tensor([-3.5, -3.0, -3.2])
ref_chosen = torch.tensor([-2.5, -2.0, -2.3]) # reference model
ref_rejected = torch.tensor([-3.0, -2.8, -3.0])
loss_good = dpo_loss(pol_chosen, pol_rejected, ref_chosen, ref_rejected)
print(f"Loss (policy prefers chosen): {loss_good:.4f}") # low
# Bad model: prefers rejected
pol_chosen_bad = torch.tensor([-3.5, -3.0, -3.2])
pol_rejected_bad = torch.tensor([-2.0, -1.5, -1.8])
loss_bad = dpo_loss(pol_chosen_bad, pol_rejected_bad, ref_chosen, ref_rejected)
print(f"Loss (policy prefers rejected): {loss_bad:.4f}") # highpython3 main.pybeta=0.1 to beta=1.0. How does the loss change for the same logprobs? High beta means the model is heavily penalized for diverging from the reference — it's more conservative. Low beta gives more freedom to change behavior.pol_chosen and pol_rejected to the same values as ref_chosen and ref_rejected. What is the loss? (It should be log(2) ≈ 0.693 — the model predicts equally well for both, which is the starting loss at initialization.)HuggingFaceH4/ultrafeedback_binarized on HuggingFace and print one example. What does chosen vs rejected look like in practice?Use these three in order. Each builds on the one before.
In one paragraph, explain what DPO optimizes: given a preferred response and a rejected response for the same prompt, what does the loss push the model to do? Why is no reward model needed?
Walk me through the DPO loss derivation intuitively: what is the `pi_logratios - ref_logratios` term measuring? Why does subtracting the reference log-ratio prevent the model from collapsing to always assigning high probability to short responses?
I have a dataset of 50K preference pairs (chosen/rejected) collected from human raters. Walk me through the full DPO pipeline: (1) which checkpoint to use as the reference model, (2) whether to run SFT before DPO or not, (3) key hyperparameters (beta, learning rate, batch size), (4) how to evaluate whether DPO improved the model, and (5) a failure mode to watch for (reward hacking, length bias, overfitting to annotator preferences).