4.8 Post-Training and Preference Optimization
Pretraining 让模型学会语言、知识和模式补全;post-training 让模型变得可用:听指令、遵守格式、避免明显坏行为、在多个候选回答中偏向人类喜欢的回答。
这一节不把 RLHF/DPO 当成 buzzwords,而是从数据格式、目标函数和实现细节拆开。
The Alignment Pipeline
一个经典 LLM post-training pipeline:
- SFT: 用 demonstration 数据做 supervised fine-tuning;
- Reward modeling: 用 preference pairs 训练 reward model;
- RLHF/PPO: 用 reward model 优化 policy,同时用 KL 约束不偏离 reference model;
- DPO/IPO/ORPO variants: 直接从 preference pairs 优化 policy,避免显式 RL loop。
In language-model post-training, the policy \(\pi_\theta(y\mid x)\) is the conditional distribution over responses \(y\) given prompt \(x\).
这里的 policy 不是另一个模型类型;它通常就是 decoder-only LM 的 next-token distribution。
Supervised Fine-Tuning
SFT 数据:
prompt x: user instruction
answer y: desired assistant response
序列化后:
<user> x
<assistant> y
训练目标通常只对 assistant tokens 计 loss:
\[ \mathcal{L}_{\text{SFT}} = - \sum_{t\in \mathcal{A}} \log \pi_\theta(y_t\mid x,y_{<t}), \]
其中 \(\mathcal{A}\) 是 assistant answer token 位置集合。
实现细节:
labels = input_ids.clone()
labels[prompt_mask == 1] = -100
labels[attention_mask == 0] = -100
loss = model(input_ids, attention_mask=attention_mask, labels=labels).lossIf prompt tokens contribute loss in SFT, the model is trained to model user messages and templates, not only to produce assistant responses.
SFT 的优点是稳定、便宜、容易 debug;缺点是它只能模仿 demonstration,不直接比较多个答案的相对质量。
SFT Collator and Packing
SFT 的关键实现不在模型类,而在 collator。一个 chat 样本通常同时包含:
system tokens
user tokens
assistant tokens
eos / end-of-message tokens
padding tokens
其中只有 assistant completion 参与 loss。更完整的 collator 逻辑是:
def build_sft_labels(input_ids, role_ids, attention_mask):
labels = input_ids.clone()
is_assistant = role_ids.eq(ASSISTANT_ROLE_ID)
is_visible = attention_mask.eq(1)
trainable = is_assistant & is_visible
labels[~trainable] = -100
return labels这里 role_ids 不一定真实存在于 tokenizer 输出;实践中常由 chat template 渲染时记录 span offset,再映射回 token positions。最容易错的是模板边界:
<assistant>
answer tokens
<eos>
<eos> 是否计入 assistant loss 是一个训练决策。如果不训练 EOS,模型可能不学会停止;如果把 user turn 的 end token 也计入 loss,模型可能学会在错误位置结束。
Packing 多个对话到一个序列时,还有两种选择:
| Packing style | Attention across samples? | Use case |
|---|---|---|
| simple concat with EOS | yes | pretraining-like text streams |
| block-diagonal attention | no | independent SFT conversations |
SFT 对话通常应该使用 block-diagonal attention 或至少插入强边界 token。否则第二个样本的 answer 可以读到第一个样本的完整对话,训练条件分布变成:
\[ \pi_\theta(y_B\mid x_B,\text{sample A}), \]
而不是期望的:
\[ \pi_\theta(y_B\mid x_B). \]
Packing independent conversations without block boundaries lets later examples attend to earlier examples. This changes the conditional task even when the loss mask looks correct.
Preference Data
Preference data 通常长这样:
prompt: x
chosen response: y_w
rejected response: y_l
其中 \(y_w\) 是 winner,\(y_l\) 是 loser。它不要求标注者给绝对分数,只要求比较两个候选。
A preference pair \((x,y_w,y_l)\) states that response \(y_w\) is preferred to response \(y_l\) under prompt \(x\).
偏好数据比 SFT demonstration 更接近真实产品目标,因为用户通常不是想要“某个唯一标准答案”,而是在多个回答中选择更有帮助、更真实、更安全、更符合格式的那个。
Preference Batch Tensor Contract
偏好训练最容易出错的地方不是公式,而是 chosen/rejected 两条序列到底如何序列化。一个 batch 至少要保存:
| field | chosen | rejected | invariant |
|---|---|---|---|
prompt_ids |
same | same | 同一个 prompt 和同一个 chat template |
input_ids |
prompt + chosen | prompt + rejected | 只 response span 不同 |
attention_mask |
full sequence padding mask | full sequence padding mask | pad 不可见,不算 loss/logprob |
labels |
response tokens, prompt as -100 |
response tokens, prompt as -100 |
logprob 只算 response |
response_mask |
assistant response span | assistant response span | 不能包含 user/system tokens |
pair_id |
same id | same id | 便于审计和 shuffle 后重组 |
在 DPO/RM 里,chosen 和 rejected 的 prompt 必须逐 token 相同。若两个分支的模板不同,比如一个多了空格、换行、<assistant>,sequence logprob 差就会混入格式因素。
The response log probability is the sum of token log probabilities only over the response span, conditioned on the prompt and previous response tokens: \[ \log \pi_\theta(y\mid x) = \sum_{t\in\mathcal{A}} \log\pi_\theta(y_t\mid x,y_{<t}). \]
一个 pair collator 可以先渲染两条完整序列,再检查 prompt prefix:
def check_pair_prefix(chosen_ids, rejected_ids, prompt_len):
if not torch.equal(chosen_ids[:prompt_len], rejected_ids[:prompt_len]):
raise ValueError("chosen/rejected prompts differ after tokenization")
def make_response_labels(input_ids, response_mask, attention_mask):
labels = input_ids.clone()
trainable = response_mask.bool() & attention_mask.bool()
labels[~trainable] = -100
return labelsIf chosen and rejected examples are independently shuffled without a stable pair id, the loss can compare responses from different prompts and silently become meaningless.
Reward Model
Reward model 给 prompt-response pair 一个标量:
\[ r_\phi(x,y)\in\mathbb{R}. \]
Bradley-Terry 模型把 reward 差转成 preference probability:
\[ P(y_w\succ y_l\mid x) = \sigma(r_\phi(x,y_w)-r_\phi(x,y_l)). \]
Reward model loss:
\[ \mathcal{L}_{\text{RM}} = - \log \sigma(r_\phi(x,y_w)-r_\phi(x,y_l)). \]
假设偏好概率为
\[ P(y_w\succ y_l\mid x)=\sigma(\Delta r), \qquad \Delta r=r_\phi(x,y_w)-r_\phi(x,y_l). \]
观测到 winner 确实赢了,因此 negative log-likelihood 是:
\[ -\log P(y_w\succ y_l\mid x) = -\log\sigma(\Delta r). \]
Reward model 通常用 LM backbone 加 scalar head。实现时常取最后一个 answer token 的 hidden state,或对 answer token pooling 后接线性层。
Reward Model Implementation
一个 reward model 常写成:
class RewardModel(nn.Module):
def __init__(self, backbone, hidden_size):
super().__init__()
self.backbone = backbone
self.score = nn.Linear(hidden_size, 1, bias=False)
def forward(self, input_ids, attention_mask, response_mask):
out = self.backbone(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True,
)
h = out.hidden_states[-1]
last_idx = last_response_index(response_mask)
batch_idx = torch.arange(h.shape[0], device=h.device)
pooled = h[batch_idx, last_idx]
return self.score(pooled).squeeze(-1)last_response_index 应该找的是最后一个 response token 在完整序列中的位置,而不是 response-local 位置:
def last_response_index(response_mask):
pos = torch.arange(response_mask.size(1), device=response_mask.device)
masked_pos = pos[None, :].masked_fill(~response_mask.bool(), -1)
last = masked_pos.max(dim=1).values
if (last < 0).any():
raise ValueError("response_mask contains an empty response")
return last也可以对 response tokens 做 mean pooling:
\[ h_{\text{resp}} = \frac{\sum_t m_t h_t}{\sum_t m_t}. \]
Reward model 的输出只在差值上有意义。Bradley-Terry loss 对 reward 加同一个常数不敏感:
\[ (r_w+c)-(r_l+c)=r_w-r_l. \]
因此 reward scale/offset 需要额外校准,尤其是后面要把 reward 放进 PPO/GRPO 时。常见做法包括 reward normalization、per-batch whitening、固定 KL coefficient 或动态调节 KL coefficient。
Pairwise reward training identifies relative preferences more directly than calibrated absolute utilities. Treat raw reward magnitudes as training signals that need monitoring and normalization.
RLHF with KL Regularization
RLHF 优化的不是裸 reward,而是 reward 和 KL penalty:
\[ \max_\pi \mathbb{E}_{y\sim\pi(\cdot\mid x)} \left[ r_\phi(x,y) - \beta \log\frac{\pi(y\mid x)}{\pi_{\text{ref}}(y\mid x)} \right]. \]
第二项约束 policy 不要离 reference model 太远。否则 reward model 的漏洞会被 policy exploit。
KL-controlled policy optimization maximizes reward while penalizing divergence from a reference policy: \[ J(\pi)=\mathbb{E}_{y\sim\pi}[r(y)]-\beta\operatorname{KL}(\pi\|\pi_{\text{ref}}). \]
对语言模型,sequence-level KL 可以写成 token logprob 差:
\[ \log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} = \sum_t \left[ \log\pi_\theta(y_t\mid x,y_{<t}) - \log\pi_{\text{ref}}(y_t\mid x,y_{<t}) \right]. \]
这就是 RLHF 实现里为什么要同时保留 policy logprobs 和 reference logprobs。
PPO Objective
PPO 用旧 policy 采样,再对新 policy 做 clipped surrogate objective。令
\[ \rho_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)} {\pi_{\text{old}}(a_t\mid s_t)}. \]
PPO clipped objective:
\[ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( \rho_t(\theta)A_t, \operatorname{clip}(\rho_t(\theta),1-\epsilon,1+\epsilon)A_t \right) \right]. \]
The advantage \(A_t\) estimates how much better an action is than the baseline value at state \(s_t\): \[ A_t\approx R_t-V(s_t). \]
在 LLM RLHF 中,state 是 prompt+partial response,action 是下一个 token。PPO 能在线采样并优化 reward,但工程复杂:
- rollout generation;
- reward scoring;
- reference KL;
- value model;
- advantage estimation;
- PPO epochs/minibatches;
- KL/reward/entropy logging。
如果数据质量足够,很多场景会优先尝试 DPO 这类 offline preference optimization。
PPO Rollout and Advantage Construction
LLM PPO 不是把整段回答当成一个普通分类样本,而是一个 token-level trajectory:
state s_t: prompt + generated tokens before t
action a_t: generated token y_t
reward: usually sequence reward plus token-level KL penalties
实际 reward 常被拆成:
\[ r_t = -\beta \left( \log\pi_\theta(y_t\mid s_t) - \log\pi_{\text{ref}}(y_t\mid s_t) \right), \]
最后一个 token 再加 reward model score:
\[ r_T \leftarrow r_T + r_\phi(x,y). \]
这样每个 token 都承担 KL 成本,最终 answer 承担偏好 reward。若有 value model \(V_\psi(s_t)\),可以用 GAE:
\[ \delta_t = r_t+\gamma V_\psi(s_{t+1})-V_\psi(s_t), \]
\[ A_t = \sum_{l=0}^{T-t} (\gamma\lambda)^l\delta_{t+l}. \]
PPO batch 里通常要保存:
| Tensor | Meaning |
|---|---|
input_ids |
prompt + sampled response |
response_mask |
generated token positions |
old_logprobs |
rollout policy logprobs |
ref_logprobs |
frozen reference logprobs |
rewards |
reward model + KL-shaped token rewards |
values |
value model predictions |
advantages |
normalized advantage estimates |
The PPO ratio compares the updated policy to the policy that generated the tokens. Recomputing only current logprobs is not enough; old_logprobs must be stored with the rollout batch.
PPO Token Masks and KL Rewards
LLM PPO 里的每个 token 都是一个 action,但不是每个 token 都应该进入 PPO loss。prompt tokens 是条件,response tokens 才是 policy rollout。设 response mask 为 \(m_t\in\{0,1\}\),则 token-level ratio 是
\[ \rho_t(\theta) = \exp\left[ \log\pi_\theta(y_t\mid s_t) - \log\pi_{\text{old}}(y_t\mid s_t) \right], \qquad t\in\mathcal{A}. \]
实现中要先 gather 已采样 token 的 logprob:
def gather_token_logprobs(logits, labels, response_mask):
# logits: [B, T, V], labels: [B, T]
logp = torch.log_softmax(logits[:, :-1], dim=-1)
target = labels[:, 1:].clamp_min(0)
mask = response_mask[:, 1:].bool()
tok_logp = logp.gather(-1, target[..., None]).squeeze(-1)
return tok_logp, mask注意这里和 SFT 一样有 label shift:位置 \(t-1\) 的 logits 预测位置 \(t\) 的 token。response_mask[:, 1:] 必须和 shifted labels 对齐。
KL shaping 常用 sampled-action estimator:
\[ k_t = \log\pi_\theta(y_t\mid s_t) - \log\pi_{\text{ref}}(y_t\mid s_t). \]
这不是完整 vocab 上的 exact KL,而是 rollout token 上的 log-ratio。它足够便宜,也正好能形成 per-token penalty:
\[ r_t^{\text{KL}}=-\beta k_t. \]
一个具体构造:
def build_ppo_rewards(scores, logp, ref_logp, response_mask, beta):
# scores: [B], scalar reward model score for whole response
mask = response_mask[:, 1:].bool()
rewards = -beta * (logp - ref_logp)
rewards = rewards * mask
lengths = mask.long().sum(dim=1)
if lengths.eq(0).any():
raise ValueError("PPO reward construction received an empty response")
last = lengths.sub(1)
batch = torch.arange(mask.size(0), device=mask.device)
rewards[batch, last] += scores
return rewards这个保护很重要:若某条样本 response 为空,last=-1 会把 reward 加到最后一个 padding token 上。因此 rollout 阶段应拒绝空 response,或在 reward 构造时显式报错。
Token log-ratio on sampled actions, exact categorical KL over the vocabulary, and sequence-level KL are different quantities. Log which one your PPO loop uses.
一个极简训练循环:
sample responses with policy_old
compute old_logprobs, ref_logprobs, rewards, values
compute advantages and returns
for ppo_epoch:
for minibatch:
recompute policy logprobs and values
optimize clipped policy loss + value loss - entropy bonus
monitor KL, reward, length, clip fraction
DPO Derivation
DPO 从 KL-regularized reward objective 出发。固定 prompt \(x\),最优 policy 满足:
\[ \pi^\star(y\mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y\mid x) \exp\left(\frac{1}{\beta}r(x,y)\right). \]
整理得到 reward 与 policy ratio 的关系:
\[ r(x,y) = \beta \log \frac{\pi^\star(y\mid x)} {\pi_{\text{ref}}(y\mid x)} + \beta\log Z(x). \]
对同一个 prompt 的 winner/loser 做差,\(Z(x)\) 抵消:
\[ r(x,y_w)-r(x,y_l) = \beta \left[ \log\frac{\pi^\star(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \log\frac{\pi^\star(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right]. \]
用当前 policy \(\pi_\theta\) 参数化 \(\pi^\star\),带入 Bradley-Terry preference model:
\[ P_\theta(y_w\succ y_l\mid x) = \sigma \left( \beta \left[ \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right] \right). \]
DPO loss:
\[ \mathcal{L}_{\text{DPO}} = - \log \sigma \left( \beta \left[ \Delta\log\pi_\theta - \Delta\log\pi_{\text{ref}} \right] \right), \]
其中
\[ \Delta\log\pi_\theta = \log\pi_\theta(y_w\mid x)-\log\pi_\theta(y_l\mid x). \]
DPO Gradient Intuition
令
\[ z = \beta \left[ (\log\pi_\theta(y_w\mid x)-\log\pi_\theta(y_l\mid x)) - (\log\pi_{\text{ref}}(y_w\mid x)-\log\pi_{\text{ref}}(y_l\mid x)) \right]. \]
DPO loss 是
\[ \ell(z)=-\log\sigma(z). \]
它的导数为
\[ \frac{\partial \ell}{\partial z} = \sigma(z)-1. \]
因此
\[ \frac{\partial \ell}{\partial \log\pi_\theta(y_w\mid x)} = \beta(\sigma(z)-1), \]
\[ \frac{\partial \ell}{\partial \log\pi_\theta(y_l\mid x)} = \beta(1-\sigma(z)). \]
梯度下降会增加 chosen logprob,降低 rejected logprob;当 \(z\) 已经很大时,\(\sigma(z)\approx1\),梯度趋近 0,说明 pair 已经被模型分开。\(\beta\) 控制这个 margin 的斜率:\(\beta\) 太小,信号弱;\(\beta\) 太大,loss 很快饱和且对噪声 pair 更敏感。
因为
\[ \ell(z)=-\log\sigma(z), \]
且
\[ \frac{d}{dz}\log\sigma(z)=1-\sigma(z), \]
所以
\[ \frac{\partial\ell}{\partial z}=\sigma(z)-1. \]
又因为
\[ \frac{\partial z}{\partial \log\pi_\theta(y_w\mid x)}=\beta, \qquad \frac{\partial z}{\partial \log\pi_\theta(y_l\mid x)}=-\beta, \]
链式法则给出上面的两个梯度。
KL-regularized objective 对每个 prompt 是:
\[ \max_\pi \sum_y\pi(y) \left[ r(y)-\beta\log\frac{\pi(y)}{\pi_{\text{ref}}(y)} \right]. \]
加入约束 \(\sum_y\pi(y)=1\),对 \(\pi(y)\) 求一阶条件:
\[ r(y)-\beta\left(\log\frac{\pi(y)}{\pi_{\text{ref}}(y)}+1\right)+\lambda=0. \]
整理:
\[ \pi(y)\propto \pi_{\text{ref}}(y)\exp(r(y)/\beta). \]
再把 reward 差带入 Bradley-Terry preference probability,就得到 DPO loss。
DPO Batch Implementation
核心是计算 chosen/rejected response 的 sequence logprob:
def sequence_logprob(model, batch):
out = model(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
)
logits = out.logits[:, :-1, :]
labels = batch["labels"][:, 1:]
mask = labels.ne(-100)
labels = labels.clamp_min(0)
logp = torch.log_softmax(logits, dim=-1)
tok_logp = logp.gather(-1, labels.unsqueeze(-1)).squeeze(-1)
return (tok_logp * mask).sum(dim=-1)DPO loss:
pi_w = sequence_logprob(policy, chosen)
pi_l = sequence_logprob(policy, rejected)
with torch.no_grad():
ref_w = sequence_logprob(reference, chosen)
ref_l = sequence_logprob(reference, rejected)
logit = beta * ((pi_w - pi_l) - (ref_w - ref_l))
loss = -torch.nn.functional.logsigmoid(logit).mean()实现坑:
- reference model 必须 frozen;
- prompt tokens 和 pad tokens 通常设为
-100; - chosen/rejected 的 logprob 要在同样模板下计算;
- sequence logprob 是 sum 还是 length-normalized 要和 recipe 一致;
- \(\beta\) 控制偏离 reference 的强度。
- chosen/rejected 必须保持 pair 对齐,不能跨 prompt 比较;
labels的 shift、response_mask的 shift 和attention_mask必须一致。
If chosen and rejected responses are tokenized under different chat templates, the logprob comparison includes formatting artifacts, not only response quality.
Length Normalization and Label Masks
DPO 的 sequence logprob 可以用 sum,也可以用 length-normalized average:
\[ \log\pi(y\mid x) = \sum_{t\in\mathcal{A}} \log\pi(y_t\mid x,y_{<t}) \]
或
\[ \bar{\ell}(y\mid x) = \frac{1}{|\mathcal{A}|} \sum_{t\in\mathcal{A}} \log\pi(y_t\mid x,y_{<t}). \]
两者目标不同。sum 会自然惩罚长回答,因为更多 token logprob 相加通常更负;average 更关注单位 token 质量,但可能鼓励变长。无论选择哪种,chosen 和 rejected 必须使用同一个 response mask 规则:
mask = labels.ne(-100)
seq_logp = (tok_logp * mask).sum(-1)
seq_len = mask.sum(-1).clamp_min(1)
avg_logp = seq_logp / seq_lenIf chosen responses are systematically longer or shorter than rejected responses, sequence logprob conventions can turn length into a shortcut feature.
GRPO and Verifiable Rewards
GRPO-style training removes the value model by comparing multiple rollouts from the same prompt. For each prompt \(x\), sample a group:
\[ y_1,\ldots,y_G\sim\pi_{\theta_{\text{old}}}(\cdot\mid x). \]
Score each response with a verifier or reward:
\[ R_i=R(x,y_i). \]
Then define group-relative advantages:
\[ A_i = \frac{R_i-\operatorname{mean}(R_1,\ldots,R_G)} {\operatorname{std}(R_1,\ldots,R_G)+\epsilon}. \]
The policy ratio can be sequence-level or token-averaged depending on implementation:
\[ \rho_i(\theta) = \exp \left( \log\pi_\theta(y_i\mid x) - \log\pi_{\theta_{\text{old}}}(y_i\mid x) \right). \]
A clipped objective mirrors PPO:
\[ \mathcal{L}_{\text{GRPO}} = - \mathbb{E}_i \left[ \min \left( \rho_i A_i, \operatorname{clip}(\rho_i,1-\epsilon,1+\epsilon)A_i \right) \right] + \beta\,\operatorname{KL}(\pi_\theta\|\pi_{\text{ref}}). \]
GRPO is attractive for math/code/verifiable tasks because reward can come from exact answer checkers, unit tests, or symbolic verifiers. It avoids training a separate value head, but it still needs careful sampling, reward normalization, KL control, and pass-rate monitoring.
A verifiable reward is produced by an external checker, such as exact answer matching, unit tests, formal validation, or a task-specific verifier, rather than only by human preference labels.
If all samples in a group are nearly identical, group-relative advantages collapse. If sampling is too noisy, rewards become high variance. GRPO quality depends strongly on rollout diversity.
Offline Preference Variants
DPO is not the only offline preference objective. The family differs mainly in how it shapes the margin between chosen and rejected responses.
| Method | Core signal | Intuition |
|---|---|---|
| DPO | logistic of policy-ratio margin | implicit reward model from KL-control |
| IPO | squared preference margin target | avoids some overconfident logistic saturation |
| ORPO | SFT-like likelihood plus odds-ratio penalty | no separate reference model in the same way |
| SimPO | length-normalized policy margin | simpler objective, explicit target margin |
The practical question is less “which acronym is newest” and more:
- Do we have a frozen reference model?
- Are chosen/rejected lengths biased?
- Is preference data on-policy or stale?
- Does the task need exploration or only ranking correction?
- Are evals sensitive to style, truthfulness, safety, or exact correctness?
This is why the same base model may use SFT for format, DPO-like training for preference style, and GRPO/RL for verifiable reasoning.
SFT vs. DPO vs. PPO
| Method | Data | Objective | Strength | Risk |
|---|---|---|---|---|
| SFT | demonstrations | next-token CE on answer | stable, simple | imitates data quality |
| Reward model | preference pairs | pairwise logistic | learns evaluator | reward hacking if optimized too hard |
| PPO-RLHF | sampled responses + reward | KL-controlled RL | can optimize online reward | complex, high variance |
| DPO | preference pairs | logistic over policy ratios | simple offline training | depends on preference data coverage |
| GRPO | grouped sampled responses + verifier/reward | group-relative clipped policy objective | strong for verifiable tasks | rollout variance and reward hacking |
实践上可以按顺序升级:
- SFT 能解决就先 SFT;
- 需要偏好对齐但不想引入在线 RL,先 DPO;
- 需要在线探索、工具反馈、可验证 reward,再考虑 PPO/GRPO/RL。
Preference Data Quality
偏好数据不只是三元组格式,关键是 candidate generation 和 annotation policy。一个 preference pair 的分布可以写成:
\[ x\sim p_{\text{prompt}}, \qquad y_1,y_2\sim q_{\text{candidate}}(\cdot\mid x), \qquad y_w\succ y_l\sim h(\cdot\mid x,y_1,y_2). \]
这里 \(q_{\text{candidate}}\) 决定比较难度。如果两个候选一个明显坏、一个明显好,模型很快学会粗粒度偏好;如果两个候选质量接近,信号更细但标注噪声更大。
常见数据问题:
| Issue | Symptom | Consequence |
|---|---|---|
| trivial pairs | chosen always much better | weak fine-grained learning |
| length bias | chosen longer/more verbose | model learns verbosity |
| template artifacts | chosen/rejected formatting differs | objective learns formatting shortcut |
| stale negatives | rejected from weak old model | little pressure on current failures |
| annotator disagreement | inconsistent winners | noisy reward/preference signal |
| benchmark contamination | eval answers in training | inflated win rate |
Preference training 前应该先做数据审计,而不是直接跑 loss。
What to Log
Post-training 不能只看 loss:
| Metric | Meaning |
|---|---|
| chosen reward / rejected reward | reward model 是否分开偏好 |
| DPO accuracy | logit 是否把 chosen 排到 rejected 前 |
| KL to reference | policy 是否漂移过大 |
| response length | 是否通过变长作弊 |
| win rate | human/model evaluator preference |
| refusal rate | 安全策略是否过度拒答 |
| format violation rate | chat template/tool-call 是否稳定 |
| reward-model margin | RM 是否过度自信或塌缩 |
| PPO/GRPO clip fraction | policy update 是否太激进 |
| entropy | 采样是否过早变窄 |
| pass@k / verifier score | 可验证任务是否真的变好 |
偏好优化的危险在于 reward 或 preference signal 只是代理目标。模型可能学会讨好 evaluator、拉长回答、套格式、回避困难问题,而不是真的变好。
Implementation Checklist
Post-training 前至少逐项确认:
- chat template 与 tokenizer、训练数据、推理服务一致;
- SFT labels 只覆盖预期 assistant tokens;
- packed conversations 是否有 block boundary;
- reward model pooling 位置是否落在 response tokens 上;
- reward scale 是否 normalization/calibration;
- PPO rollout 是否保存 old logprobs、ref logprobs、values 和 response masks;
- DPO chosen/rejected 是否使用同一模板和同一 logprob convention;
- sequence logprob 是 sum 还是 length-normalized;
- reference model 是否 frozen;
- KL、length、entropy、format violation 和 task eval 是否同时监控;
- preference data 是否审计过 length/template/stale-negative bias;
- DPO/RM pair collator 是否检查 chosen/rejected prompt prefix 完全一致;
- PPO 的 old/ref/current logprobs 是否使用同一个 shifted response mask;
- PPO KL 指标到底是 sampled log-ratio、exact token KL 还是 sequence KL;
- online RL 是否有明确 verifier/reward 和停止条件。
References
- Training language models to follow instructions with human feedback, Ouyang et al.
- Proximal Policy Optimization Algorithms, Schulman et al.
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Rafailov et al.