Alignment Methods: RLHF vs DPO

How do we teach LLMs to be helpful, harmless, and honest? Two approaches dominate: RLHF and DPO.

Post-training alignment is what transforms a base LLM into a useful assistant. This post explores the two main techniques and their trade-offs.

The Alignment Problem

Base LLM (after pre-training): - Completes text based on internet training data - No notion of "helpful" vs "harmful" - Might continue toxic prompts, refuse reasonable requests

Aligned LLM (after post-training): - Follows instructions reliably - Refuses harmful requests - Provides helpful, safe responses

How do we get there? Enter post-training alignment.

Three-Stage Pipeline

Modern LLMs follow a three-stage process:

1. Pre-training (CLM/MLM)
   ↓
2. Supervised Fine-Tuning (SFT)
   ↓
3. Preference Alignment (RLHF or DPO)

Let's explore each stage.

Stage 1: Supervised Fine-Tuning (SFT)

Goal: Teach the model to follow instructions.

Data: Human-written (prompt, response) pairs:

{
  "prompt": "What is the capital of France?",
  "response": "The capital of France is Paris."
}

Training: Standard next-token prediction, but only on the response:

class SupervisedFineTuner:
    def compute_loss(self, model, input_ids, labels):
        """
        Only compute loss on response tokens (ignore prompt)

        Input:  [PROMPT] What is the capital of France? [RESPONSE] Paris.
        Labels: [-100, -100, -100, -100, -100, -100, -100, Paris, .]
        """
        logits = model(input_ids)

        # Mask prompt tokens with -100 (ignore_index)
        loss = F.cross_entropy(
            logits.view(-1, logits.size(-1)),
            labels.view(-1),
            ignore_index=-100
        )

        return loss

Why mask the prompt? - We already know the prompt (given by user) - We want to learn responses, not repeat prompts

Limitation: SFT teaches what to say, but not how to rank responses by quality.

Stage 2: Preference Alignment

After SFT, we have a model that can follow instructions. But how do we make it better?

Key Insight: It's easier to compare two responses than write a perfect one.

Human Preference Data:

{
  "prompt": "Explain photosynthesis",
  "chosen": "Photosynthesis is the process where plants convert light...",
  "rejected": "idk google it"
}

Two approaches dominate: RLHF and DPO.

RLHF: Reinforcement Learning from Human Feedback

Used by: GPT-4, Claude, early ChatGPT

RLHF has three steps:

Step 1: Train a Reward Model

Build a model that predicts human preferences:

class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model  # Frozen SFT model
        self.reward_head = nn.Linear(d_model, 1)

    def forward(self, input_ids, attention_mask):
        """
        Returns: scalar reward for the entire sequence
        """
        # Get hidden states from base model
        hidden_states = self.base_model(input_ids, attention_mask)

        # Get last token's hidden state
        last_token_idx = attention_mask.sum(dim=1) - 1
        last_hidden = hidden_states[range(len(input_ids)), last_token_idx]

        # Project to scalar reward
        reward = self.reward_head(last_hidden).squeeze(-1)
        return reward

Training Objective (Bradley-Terry Model):

def compute_reward_loss(self, chosen_ids, rejected_ids):
    """
    Train reward model to prefer chosen over rejected

    Loss = -log σ(r_chosen - r_rejected)
    """
    r_chosen = self.reward_model(chosen_ids)
    r_rejected = self.reward_model(rejected_ids)

    # Reward model should predict: r_chosen > r_rejected
    loss = -F.logsigmoid(r_chosen - r_rejected).mean()
    return loss

Bradley-Terry Model: P(chosen > rejected) = σ(r_chosen - r_rejected)

Step 2: Optimize Policy with PPO

Use the reward model to fine-tune the SFT model:

class RLHFTrainer:
    def __init__(self, policy_model, reward_model, ref_model, beta=0.01):
        self.policy = policy_model      # Model being trained
        self.reward_model = reward_model  # Frozen reward predictor
        self.ref_model = ref_model      # Frozen SFT model (reference)
        self.beta = beta                # KL penalty coefficient

    def compute_rl_loss(self, prompts):
        """
        PPO objective with KL penalty

        reward = r(x, y) - β * KL(π_θ || π_ref)
        """
        # Generate responses from current policy
        responses = self.policy.generate(prompts)

        # Get reward from reward model
        rewards = self.reward_model(prompts + responses)

        # Compute KL divergence from reference model
        policy_logprobs = self.policy.log_prob(prompts, responses)
        ref_logprobs = self.ref_model.log_prob(prompts, responses)
        kl_penalty = (policy_logprobs - ref_logprobs).mean()

        # Combined objective
        rl_loss = -(rewards - self.beta * kl_penalty).mean()

        return rl_loss

Why KL penalty? - Prevents policy from drifting too far from SFT model - Without it: model might exploit reward model bugs, generate nonsense - β controls exploration vs exploitation

Step 3: Iterate

Collect more preference data from the updated policy, retrain reward model, repeat.

RLHF Challenges

❌ Complexity: Three models (policy, reward, reference), RL training is tricky ❌ Instability: PPO is sensitive to hyperparameters ❌ Reward Hacking: Policy might exploit reward model weaknesses ❌ Slow: Requires sampling from policy during training ❌ Memory: Need to keep 3 models in memory

DPO: Direct Preference Optimization

Used by: LLaMA 2, Mistral 7B, modern open-source models

Key Insight: We can skip the reward model and optimize preferences directly!

The DPO Trick

RLHF objective:

max E[r(x,y)] - β * KL(π_θ || π_ref)

DPO insight: This has a closed-form optimal solution:

π*(y|x) ∝ π_ref(y|x) * exp(r(x,y) / β)

Rearranging:
r(x,y) = β * log(π*(y|x) / π_ref(y|x))

Implication: Reward is implicitly defined by the policy ratio!

DPO Loss

class DirectPreferenceOptimization:
    def __init__(self, policy_model, ref_model, beta=0.1):
        self.policy = policy_model
        self.ref_model = ref_model  # Frozen SFT model
        self.beta = beta

    def compute_loss(self, prompts, chosen, rejected):
        """
        DPO Loss: Maximize likelihood ratio for chosen > rejected

        loss = -E[log σ(β * (log π_θ(chosen) / π_ref(chosen)
                           - log π_θ(rejected) / π_ref(rejected)))]
        """
        # Compute log probs for policy
        policy_chosen_logps = self.policy.log_prob(prompts, chosen)
        policy_rejected_logps = self.policy.log_prob(prompts, rejected)

        # Compute log probs for reference
        with torch.no_grad():
            ref_chosen_logps = self.ref_model.log_prob(prompts, chosen)
            ref_rejected_logps = self.ref_model.log_prob(prompts, rejected)

        # Compute log ratios (implicit rewards)
        chosen_log_ratio = policy_chosen_logps - ref_chosen_logps
        rejected_log_ratio = policy_rejected_logps - ref_rejected_logps

        # DPO loss
        logits = self.beta * (chosen_log_ratio - rejected_log_ratio)
        loss = -F.logsigmoid(logits).mean()

        return loss

DPO Intuition

What's happening? 1. Increase probability of chosen responses 2. Decrease probability of rejected responses 3. Stay close to reference model (implicit KL penalty via log ratio)

No reward model needed! The preference signal is baked directly into the loss.

RLHF vs DPO: Head-to-Head

Aspect	RLHF	DPO
Models Needed	3 (policy, reward, ref)	2 (policy, ref)
Training	RL (PPO)	Supervised (standard backprop)
Stability	Tricky (RL instability)	Stable (classification-like)
Memory	High (3 models)	Medium (2 models)
Speed	Slow (sampling required)	Fast (offline data)
Reward Hacking	Possible	Less likely
Hyperparameters	Many (PPO params)	Few (β only)
Implementation	Complex	Simple
Performance	Slightly better (when tuned)	Competitive

Which Method to Use?

Use RLHF if:

You have extensive RL expertise
You can afford the compute (3 models)
You need iterative improvement (online learning)
You're OpenAI/Anthropic with massive resources

Use DPO if:

You want simplicity and stability
You have limited compute
You're training open-source models
You prefer standard supervised learning

Modern trend: DPO is winning for open-source models due to simplicity.

Implementation Tips

Data Quality Matters

Good preference data:

{
  "prompt": "Write a poem about summer",
  "chosen": "Golden rays dance on waves so bright...",
  "rejected": "summer is hot lol"
}

Bad preference data (too similar):

{
  "prompt": "What is 2+2?",
  "chosen": "2+2 equals 4.",
  "rejected": "2+2 is 4."
}

Rule: Chosen and rejected should be clearly different in quality.

Hyperparameter Tuning

β (KL penalty coefficient): - Too small (β=0.001): Model drifts too far, generates nonsense - Too large (β=1.0): Model barely changes, stays close to SFT - Sweet spot: 0.01-0.1 for most tasks

Learning rate: - RLHF: 1e-6 to 1e-5 (RL is sensitive) - DPO: 1e-6 to 5e-6 (stable)

Evaluation

Don't just trust loss! Evaluate with:

1. Win Rate:

def compute_win_rate(model_a, model_b, test_prompts):
    """Human eval: How often does model_a beat model_b?"""
    wins = 0
    for prompt in test_prompts:
        response_a = model_a.generate(prompt)
        response_b = model_b.generate(prompt)

        # Ask human: which is better?
        if human_prefers(response_a, response_b):
            wins += 1

    return wins / len(test_prompts)

2. Safety Benchmarks: - TruthfulQA (factuality) - BBQ (bias) - RealToxicityPrompts (toxicity)

3. Capability Benchmarks: - MMLU (knowledge) - HumanEval (coding) - GSM8K (math)

Advanced: Constitutional AI

Anthropic's approach: Use AI feedback instead of human feedback.

Process: 1. Generate responses 2. AI critiques based on "constitution" (principles) 3. AI revises responses 4. Train reward model on AI preferences

Advantages: - Scalable (no human labeling bottleneck) - Consistent (AI applies principles uniformly) - Iterative (easy to update constitution)

Implementation:

def constitutional_ai_revision(self, response, principle):
    """
    Critique and revise based on constitutional principle

    Principle: "Responses should be harmless and avoid stereotypes"
    """
    # Generate critique
    critique_prompt = f"""
    Response: {response}
    Principle: {principle}

    Does this response violate the principle? If so, how?
    """
    critique = self.model.generate(critique_prompt)

    # Generate revision
    revision_prompt = f"""
    Original: {response}
    Critique: {critique}
    Principle: {principle}

    Revise the response to align with the principle:
    """
    revision = self.model.generate(revision_prompt)

    return revision

See my Constitutional AI project for full implementation.

Production Considerations

Compute Costs: - RLHF: 2-3× more expensive than SFT (3 models + sampling) - DPO: Similar to SFT (offline training)

Data Requirements: - Minimum: 10k preference pairs - Good: 100k preference pairs - SOTA: 1M+ preference pairs (GPT-4)

Iteration Speed: - SFT: 1-2 days - RLHF: 1-2 weeks (reward model + PPO tuning) - DPO: 2-3 days

Key Takeaways

Three stages: Pre-training → SFT → Preference Alignment
SFT: Teach instruction following with (prompt, response) pairs
RLHF: Train reward model, optimize with PPO, complex but effective
DPO: Direct optimization, simpler and more stable
KL penalty: Prevents drift from SFT model
β parameter: Controls exploration vs exploitation
Modern trend: DPO winning for open-source due to simplicity
Constitutional AI: Scale alignment with AI feedback