4.8 Post-Training and Preference Optimization


Pretraining 让模型学会语言、知识和模式补全;post-training 让模型变得可用:听指令、遵守格式、避免明显坏行为、在多个候选回答中偏向人类喜欢的回答。

这一节不把 RLHF/DPO 当成 buzzwords,而是从数据格式、目标函数和实现细节拆开。

The Alignment Pipeline

一个经典 LLM post-training pipeline:

  1. SFT: 用 demonstration 数据做 supervised fine-tuning;
  2. Reward modeling: 用 preference pairs 训练 reward model;
  3. RLHF/PPO: 用 reward model 优化 policy,同时用 KL 约束不偏离 reference model;
  4. DPO/IPO/ORPO variants: 直接从 preference pairs 优化 policy,避免显式 RL loop。
NoteDefinition: Policy

In language-model post-training, the policy \(\pi_\theta(y\mid x)\) is the conditional distribution over responses \(y\) given prompt \(x\).

这里的 policy 不是另一个模型类型;它通常就是 decoder-only LM 的 next-token distribution。

Supervised Fine-Tuning

SFT 数据:

prompt x:  user instruction
answer y:  desired assistant response

序列化后:

<user> x
<assistant> y

训练目标通常只对 assistant tokens 计 loss:

\[ \mathcal{L}_{\text{SFT}} = - \sum_{t\in \mathcal{A}} \log \pi_\theta(y_t\mid x,y_{<t}), \]

其中 \(\mathcal{A}\) 是 assistant answer token 位置集合。

实现细节:

labels = input_ids.clone()
labels[prompt_mask == 1] = -100
labels[attention_mask == 0] = -100
loss = model(input_ids, attention_mask=attention_mask, labels=labels).loss
WarningPitfall: Training on Prompt Tokens Changes the Task

If prompt tokens contribute loss in SFT, the model is trained to model user messages and templates, not only to produce assistant responses.

SFT 的优点是稳定、便宜、容易 debug;缺点是它只能模仿 demonstration,不直接比较多个答案的相对质量。

SFT Collator and Packing

SFT 的关键实现不在模型类,而在 collator。一个 chat 样本通常同时包含:

system tokens
user tokens
assistant tokens
eos / end-of-message tokens
padding tokens

其中只有 assistant completion 参与 loss。更完整的 collator 逻辑是:

def build_sft_labels(input_ids, role_ids, attention_mask):
    labels = input_ids.clone()

    is_assistant = role_ids.eq(ASSISTANT_ROLE_ID)
    is_visible = attention_mask.eq(1)
    trainable = is_assistant & is_visible

    labels[~trainable] = -100
    return labels

这里 role_ids 不一定真实存在于 tokenizer 输出;实践中常由 chat template 渲染时记录 span offset,再映射回 token positions。最容易错的是模板边界:

<assistant>
answer tokens
<eos>

<eos> 是否计入 assistant loss 是一个训练决策。如果不训练 EOS,模型可能不学会停止;如果把 user turn 的 end token 也计入 loss,模型可能学会在错误位置结束。

Packing 多个对话到一个序列时,还有两种选择:

Packing style Attention across samples? Use case
simple concat with EOS yes pretraining-like text streams
block-diagonal attention no independent SFT conversations

SFT 对话通常应该使用 block-diagonal attention 或至少插入强边界 token。否则第二个样本的 answer 可以读到第一个样本的完整对话,训练条件分布变成:

\[ \pi_\theta(y_B\mid x_B,\text{sample A}), \]

而不是期望的:

\[ \pi_\theta(y_B\mid x_B). \]

WarningPitfall: SFT Packing Can Leak Conversations

Packing independent conversations without block boundaries lets later examples attend to earlier examples. This changes the conditional task even when the loss mask looks correct.

Preference Data

Preference data 通常长这样:

prompt: x
chosen response: y_w
rejected response: y_l

其中 \(y_w\) 是 winner,\(y_l\) 是 loser。它不要求标注者给绝对分数,只要求比较两个候选。

NoteDefinition: Preference Pair

A preference pair \((x,y_w,y_l)\) states that response \(y_w\) is preferred to response \(y_l\) under prompt \(x\).

偏好数据比 SFT demonstration 更接近真实产品目标,因为用户通常不是想要“某个唯一标准答案”,而是在多个回答中选择更有帮助、更真实、更安全、更符合格式的那个。

Preference Batch Tensor Contract

偏好训练最容易出错的地方不是公式,而是 chosen/rejected 两条序列到底如何序列化。一个 batch 至少要保存:

field chosen rejected invariant
prompt_ids same same 同一个 prompt 和同一个 chat template
input_ids prompt + chosen prompt + rejected 只 response span 不同
attention_mask full sequence padding mask full sequence padding mask pad 不可见,不算 loss/logprob
labels response tokens, prompt as -100 response tokens, prompt as -100 logprob 只算 response
response_mask assistant response span assistant response span 不能包含 user/system tokens
pair_id same id same id 便于审计和 shuffle 后重组

在 DPO/RM 里,chosen 和 rejected 的 prompt 必须逐 token 相同。若两个分支的模板不同,比如一个多了空格、换行、<assistant>,sequence logprob 差就会混入格式因素。

NoteDefinition: Response Log Probability

The response log probability is the sum of token log probabilities only over the response span, conditioned on the prompt and previous response tokens: \[ \log \pi_\theta(y\mid x) = \sum_{t\in\mathcal{A}} \log\pi_\theta(y_t\mid x,y_{<t}). \]

一个 pair collator 可以先渲染两条完整序列,再检查 prompt prefix:

def check_pair_prefix(chosen_ids, rejected_ids, prompt_len):
    if not torch.equal(chosen_ids[:prompt_len], rejected_ids[:prompt_len]):
        raise ValueError("chosen/rejected prompts differ after tokenization")

def make_response_labels(input_ids, response_mask, attention_mask):
    labels = input_ids.clone()
    trainable = response_mask.bool() & attention_mask.bool()
    labels[~trainable] = -100
    return labels
WarningPitfall: Pairwise Losses Need Pairwise Collation

If chosen and rejected examples are independently shuffled without a stable pair id, the loss can compare responses from different prompts and silently become meaningless.

Reward Model

Reward model 给 prompt-response pair 一个标量:

\[ r_\phi(x,y)\in\mathbb{R}. \]

Bradley-Terry 模型把 reward 差转成 preference probability:

\[ P(y_w\succ y_l\mid x) = \sigma(r_\phi(x,y_w)-r_\phi(x,y_l)). \]

Reward model loss:

\[ \mathcal{L}_{\text{RM}} = - \log \sigma(r_\phi(x,y_w)-r_\phi(x,y_l)). \]

假设偏好概率为

\[ P(y_w\succ y_l\mid x)=\sigma(\Delta r), \qquad \Delta r=r_\phi(x,y_w)-r_\phi(x,y_l). \]

观测到 winner 确实赢了,因此 negative log-likelihood 是:

\[ -\log P(y_w\succ y_l\mid x) = -\log\sigma(\Delta r). \]

Reward model 通常用 LM backbone 加 scalar head。实现时常取最后一个 answer token 的 hidden state,或对 answer token pooling 后接线性层。

Reward Model Implementation

一个 reward model 常写成:

class RewardModel(nn.Module):
    def __init__(self, backbone, hidden_size):
        super().__init__()
        self.backbone = backbone
        self.score = nn.Linear(hidden_size, 1, bias=False)

    def forward(self, input_ids, attention_mask, response_mask):
        out = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        h = out.hidden_states[-1]
        last_idx = last_response_index(response_mask)
        batch_idx = torch.arange(h.shape[0], device=h.device)
        pooled = h[batch_idx, last_idx]
        return self.score(pooled).squeeze(-1)

last_response_index 应该找的是最后一个 response token 在完整序列中的位置,而不是 response-local 位置:

def last_response_index(response_mask):
    pos = torch.arange(response_mask.size(1), device=response_mask.device)
    masked_pos = pos[None, :].masked_fill(~response_mask.bool(), -1)
    last = masked_pos.max(dim=1).values
    if (last < 0).any():
        raise ValueError("response_mask contains an empty response")
    return last

也可以对 response tokens 做 mean pooling:

\[ h_{\text{resp}} = \frac{\sum_t m_t h_t}{\sum_t m_t}. \]

Reward model 的输出只在差值上有意义。Bradley-Terry loss 对 reward 加同一个常数不敏感:

\[ (r_w+c)-(r_l+c)=r_w-r_l. \]

因此 reward scale/offset 需要额外校准,尤其是后面要把 reward 放进 PPO/GRPO 时。常见做法包括 reward normalization、per-batch whitening、固定 KL coefficient 或动态调节 KL coefficient。

WarningPitfall: Reward Scores Are Not Absolute Truth

Pairwise reward training identifies relative preferences more directly than calibrated absolute utilities. Treat raw reward magnitudes as training signals that need monitoring and normalization.

RLHF with KL Regularization

RLHF 优化的不是裸 reward,而是 reward 和 KL penalty:

\[ \max_\pi \mathbb{E}_{y\sim\pi(\cdot\mid x)} \left[ r_\phi(x,y) - \beta \log\frac{\pi(y\mid x)}{\pi_{\text{ref}}(y\mid x)} \right]. \]

第二项约束 policy 不要离 reference model 太远。否则 reward model 的漏洞会被 policy exploit。

NoteDefinition: KL-Controlled Policy Objective

KL-controlled policy optimization maximizes reward while penalizing divergence from a reference policy: \[ J(\pi)=\mathbb{E}_{y\sim\pi}[r(y)]-\beta\operatorname{KL}(\pi\|\pi_{\text{ref}}). \]

对语言模型,sequence-level KL 可以写成 token logprob 差:

\[ \log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} = \sum_t \left[ \log\pi_\theta(y_t\mid x,y_{<t}) - \log\pi_{\text{ref}}(y_t\mid x,y_{<t}) \right]. \]

这就是 RLHF 实现里为什么要同时保留 policy logprobs 和 reference logprobs。

PPO Objective

PPO 用旧 policy 采样,再对新 policy 做 clipped surrogate objective。令

\[ \rho_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)} {\pi_{\text{old}}(a_t\mid s_t)}. \]

PPO clipped objective:

\[ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( \rho_t(\theta)A_t, \operatorname{clip}(\rho_t(\theta),1-\epsilon,1+\epsilon)A_t \right) \right]. \]

NoteDefinition: Advantage

The advantage \(A_t\) estimates how much better an action is than the baseline value at state \(s_t\): \[ A_t\approx R_t-V(s_t). \]

在 LLM RLHF 中,state 是 prompt+partial response,action 是下一个 token。PPO 能在线采样并优化 reward,但工程复杂:

  1. rollout generation;
  2. reward scoring;
  3. reference KL;
  4. value model;
  5. advantage estimation;
  6. PPO epochs/minibatches;
  7. KL/reward/entropy logging。

如果数据质量足够,很多场景会优先尝试 DPO 这类 offline preference optimization。

PPO Rollout and Advantage Construction

LLM PPO 不是把整段回答当成一个普通分类样本,而是一个 token-level trajectory:

state s_t: prompt + generated tokens before t
action a_t: generated token y_t
reward: usually sequence reward plus token-level KL penalties

实际 reward 常被拆成:

\[ r_t = -\beta \left( \log\pi_\theta(y_t\mid s_t) - \log\pi_{\text{ref}}(y_t\mid s_t) \right), \]

最后一个 token 再加 reward model score:

\[ r_T \leftarrow r_T + r_\phi(x,y). \]

这样每个 token 都承担 KL 成本,最终 answer 承担偏好 reward。若有 value model \(V_\psi(s_t)\),可以用 GAE:

\[ \delta_t = r_t+\gamma V_\psi(s_{t+1})-V_\psi(s_t), \]

\[ A_t = \sum_{l=0}^{T-t} (\gamma\lambda)^l\delta_{t+l}. \]

PPO batch 里通常要保存:

Tensor Meaning
input_ids prompt + sampled response
response_mask generated token positions
old_logprobs rollout policy logprobs
ref_logprobs frozen reference logprobs
rewards reward model + KL-shaped token rewards
values value model predictions
advantages normalized advantage estimates
WarningPitfall: PPO Needs the Rollout Logprobs

The PPO ratio compares the updated policy to the policy that generated the tokens. Recomputing only current logprobs is not enough; old_logprobs must be stored with the rollout batch.

PPO Token Masks and KL Rewards

LLM PPO 里的每个 token 都是一个 action,但不是每个 token 都应该进入 PPO loss。prompt tokens 是条件,response tokens 才是 policy rollout。设 response mask 为 \(m_t\in\{0,1\}\),则 token-level ratio 是

\[ \rho_t(\theta) = \exp\left[ \log\pi_\theta(y_t\mid s_t) - \log\pi_{\text{old}}(y_t\mid s_t) \right], \qquad t\in\mathcal{A}. \]

实现中要先 gather 已采样 token 的 logprob:

def gather_token_logprobs(logits, labels, response_mask):
    # logits: [B, T, V], labels: [B, T]
    logp = torch.log_softmax(logits[:, :-1], dim=-1)
    target = labels[:, 1:].clamp_min(0)
    mask = response_mask[:, 1:].bool()
    tok_logp = logp.gather(-1, target[..., None]).squeeze(-1)
    return tok_logp, mask

注意这里和 SFT 一样有 label shift:位置 \(t-1\) 的 logits 预测位置 \(t\) 的 token。response_mask[:, 1:] 必须和 shifted labels 对齐。

KL shaping 常用 sampled-action estimator:

\[ k_t = \log\pi_\theta(y_t\mid s_t) - \log\pi_{\text{ref}}(y_t\mid s_t). \]

这不是完整 vocab 上的 exact KL,而是 rollout token 上的 log-ratio。它足够便宜,也正好能形成 per-token penalty:

\[ r_t^{\text{KL}}=-\beta k_t. \]

一个具体构造:

def build_ppo_rewards(scores, logp, ref_logp, response_mask, beta):
    # scores: [B], scalar reward model score for whole response
    mask = response_mask[:, 1:].bool()
    rewards = -beta * (logp - ref_logp)
    rewards = rewards * mask

    lengths = mask.long().sum(dim=1)
    if lengths.eq(0).any():
        raise ValueError("PPO reward construction received an empty response")

    last = lengths.sub(1)
    batch = torch.arange(mask.size(0), device=mask.device)
    rewards[batch, last] += scores
    return rewards

这个保护很重要:若某条样本 response 为空,last=-1 会把 reward 加到最后一个 padding token 上。因此 rollout 阶段应拒绝空 response,或在 reward 构造时显式报错。

WarningPitfall: KL Estimator Scope Must Be Logged

Token log-ratio on sampled actions, exact categorical KL over the vocabulary, and sequence-level KL are different quantities. Log which one your PPO loop uses.

一个极简训练循环:

sample responses with policy_old
compute old_logprobs, ref_logprobs, rewards, values
compute advantages and returns
for ppo_epoch:
    for minibatch:
        recompute policy logprobs and values
        optimize clipped policy loss + value loss - entropy bonus
monitor KL, reward, length, clip fraction

DPO Derivation

DPO 从 KL-regularized reward objective 出发。固定 prompt \(x\),最优 policy 满足:

\[ \pi^\star(y\mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y\mid x) \exp\left(\frac{1}{\beta}r(x,y)\right). \]

整理得到 reward 与 policy ratio 的关系:

\[ r(x,y) = \beta \log \frac{\pi^\star(y\mid x)} {\pi_{\text{ref}}(y\mid x)} + \beta\log Z(x). \]

对同一个 prompt 的 winner/loser 做差,\(Z(x)\) 抵消:

\[ r(x,y_w)-r(x,y_l) = \beta \left[ \log\frac{\pi^\star(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \log\frac{\pi^\star(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right]. \]

用当前 policy \(\pi_\theta\) 参数化 \(\pi^\star\),带入 Bradley-Terry preference model:

\[ P_\theta(y_w\succ y_l\mid x) = \sigma \left( \beta \left[ \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right] \right). \]

DPO loss:

\[ \mathcal{L}_{\text{DPO}} = - \log \sigma \left( \beta \left[ \Delta\log\pi_\theta - \Delta\log\pi_{\text{ref}} \right] \right), \]

其中

\[ \Delta\log\pi_\theta = \log\pi_\theta(y_w\mid x)-\log\pi_\theta(y_l\mid x). \]

DPO Gradient Intuition

\[ z = \beta \left[ (\log\pi_\theta(y_w\mid x)-\log\pi_\theta(y_l\mid x)) - (\log\pi_{\text{ref}}(y_w\mid x)-\log\pi_{\text{ref}}(y_l\mid x)) \right]. \]

DPO loss 是

\[ \ell(z)=-\log\sigma(z). \]

它的导数为

\[ \frac{\partial \ell}{\partial z} = \sigma(z)-1. \]

因此

\[ \frac{\partial \ell}{\partial \log\pi_\theta(y_w\mid x)} = \beta(\sigma(z)-1), \]

\[ \frac{\partial \ell}{\partial \log\pi_\theta(y_l\mid x)} = \beta(1-\sigma(z)). \]

梯度下降会增加 chosen logprob,降低 rejected logprob;当 \(z\) 已经很大时,\(\sigma(z)\approx1\),梯度趋近 0,说明 pair 已经被模型分开。\(\beta\) 控制这个 margin 的斜率:\(\beta\) 太小,信号弱;\(\beta\) 太大,loss 很快饱和且对噪声 pair 更敏感。

因为

\[ \ell(z)=-\log\sigma(z), \]

\[ \frac{d}{dz}\log\sigma(z)=1-\sigma(z), \]

所以

\[ \frac{\partial\ell}{\partial z}=\sigma(z)-1. \]

又因为

\[ \frac{\partial z}{\partial \log\pi_\theta(y_w\mid x)}=\beta, \qquad \frac{\partial z}{\partial \log\pi_\theta(y_l\mid x)}=-\beta, \]

链式法则给出上面的两个梯度。

KL-regularized objective 对每个 prompt 是:

\[ \max_\pi \sum_y\pi(y) \left[ r(y)-\beta\log\frac{\pi(y)}{\pi_{\text{ref}}(y)} \right]. \]

加入约束 \(\sum_y\pi(y)=1\),对 \(\pi(y)\) 求一阶条件:

\[ r(y)-\beta\left(\log\frac{\pi(y)}{\pi_{\text{ref}}(y)}+1\right)+\lambda=0. \]

整理:

\[ \pi(y)\propto \pi_{\text{ref}}(y)\exp(r(y)/\beta). \]

再把 reward 差带入 Bradley-Terry preference probability,就得到 DPO loss。

DPO Batch Implementation

核心是计算 chosen/rejected response 的 sequence logprob:

def sequence_logprob(model, batch):
    out = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
    )
    logits = out.logits[:, :-1, :]
    labels = batch["labels"][:, 1:]
    mask = labels.ne(-100)
    labels = labels.clamp_min(0)

    logp = torch.log_softmax(logits, dim=-1)
    tok_logp = logp.gather(-1, labels.unsqueeze(-1)).squeeze(-1)
    return (tok_logp * mask).sum(dim=-1)

DPO loss:

pi_w = sequence_logprob(policy, chosen)
pi_l = sequence_logprob(policy, rejected)

with torch.no_grad():
    ref_w = sequence_logprob(reference, chosen)
    ref_l = sequence_logprob(reference, rejected)

logit = beta * ((pi_w - pi_l) - (ref_w - ref_l))
loss = -torch.nn.functional.logsigmoid(logit).mean()

实现坑:

  1. reference model 必须 frozen;
  2. prompt tokens 和 pad tokens 通常设为 -100
  3. chosen/rejected 的 logprob 要在同样模板下计算;
  4. sequence logprob 是 sum 还是 length-normalized 要和 recipe 一致;
  5. \(\beta\) 控制偏离 reference 的强度。
  6. chosen/rejected 必须保持 pair 对齐,不能跨 prompt 比较;
  7. labels 的 shift、response_mask 的 shift 和 attention_mask 必须一致。
WarningPitfall: DPO Is Sensitive to Formatting

If chosen and rejected responses are tokenized under different chat templates, the logprob comparison includes formatting artifacts, not only response quality.

Length Normalization and Label Masks

DPO 的 sequence logprob 可以用 sum,也可以用 length-normalized average:

\[ \log\pi(y\mid x) = \sum_{t\in\mathcal{A}} \log\pi(y_t\mid x,y_{<t}) \]

\[ \bar{\ell}(y\mid x) = \frac{1}{|\mathcal{A}|} \sum_{t\in\mathcal{A}} \log\pi(y_t\mid x,y_{<t}). \]

两者目标不同。sum 会自然惩罚长回答,因为更多 token logprob 相加通常更负;average 更关注单位 token 质量,但可能鼓励变长。无论选择哪种,chosen 和 rejected 必须使用同一个 response mask 规则:

mask = labels.ne(-100)
seq_logp = (tok_logp * mask).sum(-1)
seq_len = mask.sum(-1).clamp_min(1)
avg_logp = seq_logp / seq_len
WarningPitfall: DPO Can Learn Length Bias

If chosen responses are systematically longer or shorter than rejected responses, sequence logprob conventions can turn length into a shortcut feature.

GRPO and Verifiable Rewards

GRPO-style training removes the value model by comparing multiple rollouts from the same prompt. For each prompt \(x\), sample a group:

\[ y_1,\ldots,y_G\sim\pi_{\theta_{\text{old}}}(\cdot\mid x). \]

Score each response with a verifier or reward:

\[ R_i=R(x,y_i). \]

Then define group-relative advantages:

\[ A_i = \frac{R_i-\operatorname{mean}(R_1,\ldots,R_G)} {\operatorname{std}(R_1,\ldots,R_G)+\epsilon}. \]

The policy ratio can be sequence-level or token-averaged depending on implementation:

\[ \rho_i(\theta) = \exp \left( \log\pi_\theta(y_i\mid x) - \log\pi_{\theta_{\text{old}}}(y_i\mid x) \right). \]

A clipped objective mirrors PPO:

\[ \mathcal{L}_{\text{GRPO}} = - \mathbb{E}_i \left[ \min \left( \rho_i A_i, \operatorname{clip}(\rho_i,1-\epsilon,1+\epsilon)A_i \right) \right] + \beta\,\operatorname{KL}(\pi_\theta\|\pi_{\text{ref}}). \]

GRPO is attractive for math/code/verifiable tasks because reward can come from exact answer checkers, unit tests, or symbolic verifiers. It avoids training a separate value head, but it still needs careful sampling, reward normalization, KL control, and pass-rate monitoring.

NoteDefinition: Verifiable Reward

A verifiable reward is produced by an external checker, such as exact answer matching, unit tests, formal validation, or a task-specific verifier, rather than only by human preference labels.

WarningPitfall: Group Advantage Depends on Sampling Temperature

If all samples in a group are nearly identical, group-relative advantages collapse. If sampling is too noisy, rewards become high variance. GRPO quality depends strongly on rollout diversity.

Offline Preference Variants

DPO is not the only offline preference objective. The family differs mainly in how it shapes the margin between chosen and rejected responses.

Method Core signal Intuition
DPO logistic of policy-ratio margin implicit reward model from KL-control
IPO squared preference margin target avoids some overconfident logistic saturation
ORPO SFT-like likelihood plus odds-ratio penalty no separate reference model in the same way
SimPO length-normalized policy margin simpler objective, explicit target margin

The practical question is less “which acronym is newest” and more:

  1. Do we have a frozen reference model?
  2. Are chosen/rejected lengths biased?
  3. Is preference data on-policy or stale?
  4. Does the task need exploration or only ranking correction?
  5. Are evals sensitive to style, truthfulness, safety, or exact correctness?

This is why the same base model may use SFT for format, DPO-like training for preference style, and GRPO/RL for verifiable reasoning.

SFT vs. DPO vs. PPO

Method Data Objective Strength Risk
SFT demonstrations next-token CE on answer stable, simple imitates data quality
Reward model preference pairs pairwise logistic learns evaluator reward hacking if optimized too hard
PPO-RLHF sampled responses + reward KL-controlled RL can optimize online reward complex, high variance
DPO preference pairs logistic over policy ratios simple offline training depends on preference data coverage
GRPO grouped sampled responses + verifier/reward group-relative clipped policy objective strong for verifiable tasks rollout variance and reward hacking

实践上可以按顺序升级:

  1. SFT 能解决就先 SFT;
  2. 需要偏好对齐但不想引入在线 RL,先 DPO;
  3. 需要在线探索、工具反馈、可验证 reward,再考虑 PPO/GRPO/RL。

Preference Data Quality

偏好数据不只是三元组格式,关键是 candidate generation 和 annotation policy。一个 preference pair 的分布可以写成:

\[ x\sim p_{\text{prompt}}, \qquad y_1,y_2\sim q_{\text{candidate}}(\cdot\mid x), \qquad y_w\succ y_l\sim h(\cdot\mid x,y_1,y_2). \]

这里 \(q_{\text{candidate}}\) 决定比较难度。如果两个候选一个明显坏、一个明显好,模型很快学会粗粒度偏好;如果两个候选质量接近,信号更细但标注噪声更大。

常见数据问题:

Issue Symptom Consequence
trivial pairs chosen always much better weak fine-grained learning
length bias chosen longer/more verbose model learns verbosity
template artifacts chosen/rejected formatting differs objective learns formatting shortcut
stale negatives rejected from weak old model little pressure on current failures
annotator disagreement inconsistent winners noisy reward/preference signal
benchmark contamination eval answers in training inflated win rate

Preference training 前应该先做数据审计,而不是直接跑 loss。

What to Log

Post-training 不能只看 loss:

Metric Meaning
chosen reward / rejected reward reward model 是否分开偏好
DPO accuracy logit 是否把 chosen 排到 rejected 前
KL to reference policy 是否漂移过大
response length 是否通过变长作弊
win rate human/model evaluator preference
refusal rate 安全策略是否过度拒答
format violation rate chat template/tool-call 是否稳定
reward-model margin RM 是否过度自信或塌缩
PPO/GRPO clip fraction policy update 是否太激进
entropy 采样是否过早变窄
pass@k / verifier score 可验证任务是否真的变好

偏好优化的危险在于 reward 或 preference signal 只是代理目标。模型可能学会讨好 evaluator、拉长回答、套格式、回避困难问题,而不是真的变好。

Implementation Checklist

Post-training 前至少逐项确认:

  1. chat template 与 tokenizer、训练数据、推理服务一致;
  2. SFT labels 只覆盖预期 assistant tokens;
  3. packed conversations 是否有 block boundary;
  4. reward model pooling 位置是否落在 response tokens 上;
  5. reward scale 是否 normalization/calibration;
  6. PPO rollout 是否保存 old logprobs、ref logprobs、values 和 response masks;
  7. DPO chosen/rejected 是否使用同一模板和同一 logprob convention;
  8. sequence logprob 是 sum 还是 length-normalized;
  9. reference model 是否 frozen;
  10. KL、length、entropy、format violation 和 task eval 是否同时监控;
  11. preference data 是否审计过 length/template/stale-negative bias;
  12. DPO/RM pair collator 是否检查 chosen/rejected prompt prefix 完全一致;
  13. PPO 的 old/ref/current logprobs 是否使用同一个 shifted response mask;
  14. PPO KL 指标到底是 sampled log-ratio、exact token KL 还是 sequence KL;
  15. online RL 是否有明确 verifier/reward 和停止条件。

References