DPO — direct preference optimization without a reward model

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

RLHF (reinforcement learning from human feedback) is powerful but complex: it requires training a separate reward model, then running PPO (proximal policy optimization) — a notoriously unstable RL algorithm — to update the LLM policy. DPO (Direct Preference Optimization) achieves the same goal — making the model prefer 'chosen' responses over 'rejected' ones — without a reward model and without RL. It reformulates the preference learning problem as a supervised classification loss that can be computed directly on the policy. In practice, DPO is simpler to implement, more stable to train, and produces comparable or better results than RLHF on most instruction-following benchmarks.

Demo

DPO sidesteps the reward model by observing that the optimal policy under a KL-constrained RL objective has a closed form: it's proportional to the reference policy scaled by exp(r/β). Substituting that back into the preference probability yields a loss you can compute directly on (prompt, chosen, rejected) triples. The formula -log σ(β · [(log π(c) − log π(r)) − (log π_ref(c) − log π_ref(r))]) says: push the policy to prefer chosen over rejected by a margin larger than the reference already does, with β controlling how aggressively.

import torch
import torch.nn.functional as F

def dpo_loss(policy_chosen_logp, policy_rejected_logp,
             ref_chosen_logp,    ref_rejected_logp,
             beta=0.1):
    """
    DPO loss (Rafailov et al. 2023).
    policy_*_logp : log-prob of chosen/rejected response under the model being trained
    ref_*_logp    : log-prob under the frozen reference model (the SFT checkpoint)
    beta          : KL penalty coefficient — higher = stay closer to reference
    """
    pi_logratios  = policy_chosen_logp  - policy_rejected_logp
    ref_logratios = ref_chosen_logp     - ref_rejected_logp
    logits = beta * (pi_logratios - ref_logratios)
    loss   = -F.logsigmoid(logits).mean()
    return loss

# Toy example: model prefers chosen (as it should after training)
pol_chosen   = torch.tensor([-2.0, -1.5, -1.8])   # log-probs per example
pol_rejected = torch.tensor([-3.5, -3.0, -3.2])
ref_chosen   = torch.tensor([-2.5, -2.0, -2.3])   # reference model
ref_rejected = torch.tensor([-3.0, -2.8, -3.0])

loss_good = dpo_loss(pol_chosen, pol_rejected, ref_chosen, ref_rejected)
print(f"Loss (policy prefers chosen): {loss_good:.4f}")   # low

# Bad model: prefers rejected
pol_chosen_bad   = torch.tensor([-3.5, -3.0, -3.2])
pol_rejected_bad = torch.tensor([-2.0, -1.5, -1.8])

loss_bad = dpo_loss(pol_chosen_bad, pol_rejected_bad, ref_chosen, ref_rejected)
print(f"Loss (policy prefers rejected): {loss_bad:.4f}")  # high

Run: python3 main.py

Try it yourself

Change beta=0.1 to beta=1.0. How does the loss change for the same logprobs? High beta means the model is heavily penalized for diverging from the reference — it's more conservative. Low beta gives more freedom to change behavior.

Set pol_chosen and pol_rejected to the same values as ref_chosen and ref_rejected. What is the loss? (It should be log(2) ≈ 0.693 — the model predicts equally well for both, which is the starting loss at initialization.)

Look up the DPO dataset format: a preferred (chosen) completion and a rejected completion for the same prompt. Find HuggingFaceH4/ultrafeedback_binarized on HuggingFace and print one example. What does chosen vs rejected look like in practice?

Research SimPO (Simple Preference Optimization) and its key difference from DPO: it removes the reference model entirely. What does it use instead? When would you choose SimPO over DPO?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what DPO optimizes: given a preferred response and a rejected response for the same prompt, what does the loss push the model to do? Why is no reward model needed?

2. Why it works (the mechanism)

Walk me through the DPO loss derivation intuitively: what is the `pi_logratios - ref_logratios` term measuring? Why does subtracting the reference log-ratio prevent the model from collapsing to always assigning high probability to short responses?

3. Advanced — application & what's next

I have a dataset of 50K preference pairs (chosen/rejected) collected from human raters. Walk me through the full DPO pipeline: (1) which checkpoint to use as the reference model, (2) whether to run SFT before DPO or not, (3) key hyperparameters (beta, learning rate, batch size), (4) how to evaluate whether DPO improved the model, and (5) a failure mode to watch for (reward hacking, length bias, overfitting to annotator preferences).