4.8 Post-Training and Preference Optimization

Pretraining 让模型学会语言、知识和模式补全；post-training 让模型变得可用：听指令、遵守格式、避免明显坏行为、在多个候选回答中偏向人类喜欢的回答。

这一节不把 RLHF/DPO 当成 buzzwords，而是从数据格式、目标函数和实现细节拆开。

The Alignment Pipeline

一个经典 LLM post-training pipeline：

SFT: 用 demonstration 数据做 supervised fine-tuning；
Reward modeling: 用 preference pairs 训练 reward model；
RLHF/PPO: 用 reward model 优化 policy，同时用 KL 约束不偏离 reference model；
DPO/IPO/ORPO variants: 直接从 preference pairs 优化 policy，避免显式 RL loop。

Definition: Policy

In language-model post-training, the policy \(\pi_\theta(y\mid x)\) is the conditional distribution over responses \(y\) given prompt \(x\).

这里的 policy 不是另一个模型类型；它通常就是 decoder-only LM 的 next-token distribution。

Supervised Fine-Tuning

SFT 数据：

prompt x:  user instruction
answer y:  desired assistant response

序列化后：

<user> x
<assistant> y

训练目标通常只对 assistant tokens 计 loss：

\[ \mathcal{L}_{\text{SFT}} = - \sum_{t\in \mathcal{A}} \log \pi_\theta(y_t\mid x,y_{<t}), \]

其中 \(\mathcal{A}\) 是 assistant answer token 位置集合。

实现细节：

labels = input_ids.clone()
labels[prompt_mask == 1] = -100
labels[attention_mask == 0] = -100
loss = model(input_ids, attention_mask=attention_mask, labels=labels).loss

Pitfall: Training on Prompt Tokens Changes the Task

If prompt tokens contribute loss in SFT, the model is trained to model user messages and templates, not only to produce assistant responses.

SFT 的优点是稳定、便宜、容易 debug；缺点是它只能模仿 demonstration，不直接比较多个答案的相对质量。

SFT Collator and Packing

SFT 的关键实现不在模型类，而在 collator。一个 chat 样本通常同时包含：

system tokens
user tokens
assistant tokens
eos / end-of-message tokens
padding tokens

其中只有 assistant completion 参与 loss。更完整的 collator 逻辑是：

def build_sft_labels(input_ids, role_ids, attention_mask):
    labels = input_ids.clone()

    is_assistant = role_ids.eq(ASSISTANT_ROLE_ID)
    is_visible = attention_mask.eq(1)
    trainable = is_assistant & is_visible

    labels[~trainable] = -100
    return labels

这里 role_ids 不一定真实存在于 tokenizer 输出；实践中常由 chat template 渲染时记录 span offset，再映射回 token positions。最容易错的是模板边界：

<assistant>
answer tokens
<eos>

<eos> 是否计入 assistant loss 是一个训练决策。如果不训练 EOS，模型可能不学会停止；如果把 user turn 的 end token 也计入 loss，模型可能学会在错误位置结束。

Packing 多个对话到一个序列时，还有两种选择：

Packing style	Attention across samples?	Use case
simple concat with EOS	yes	pretraining-like text streams
block-diagonal attention	no	independent SFT conversations

SFT 对话通常应该使用 block-diagonal attention 或至少插入强边界 token。否则第二个样本的 answer 可以读到第一个样本的完整对话，训练条件分布变成：

\[ \pi_\theta(y_B\mid x_B,\text{sample A}), \]

而不是期望的：

\[ \pi_\theta(y_B\mid x_B). \]

Pitfall: SFT Packing Can Leak Conversations

Packing independent conversations without block boundaries lets later examples attend to earlier examples. This changes the conditional task even when the loss mask looks correct.

Preference Data

Preference data 通常长这样：

prompt: x
chosen response: y_w
rejected response: y_l

其中 \(y_w\) 是 winner，\(y_l\) 是 loser。它不要求标注者给绝对分数，只要求比较两个候选。

Definition: Preference Pair

A preference pair \((x,y_w,y_l)\) states that response \(y_w\) is preferred to response \(y_l\) under prompt \(x\).

偏好数据比 SFT demonstration 更接近真实产品目标，因为用户通常不是想要“某个唯一标准答案”，而是在多个回答中选择更有帮助、更真实、更安全、更符合格式的那个。

Preference Batch Tensor Contract

偏好训练最容易出错的地方不是公式，而是 chosen/rejected 两条序列到底如何序列化。一个 batch 至少要保存：

field	chosen	rejected	invariant
`prompt_ids`	same	same	同一个 prompt 和同一个 chat template
`input_ids`	prompt + chosen	prompt + rejected	只 response span 不同
`attention_mask`	full sequence padding mask	full sequence padding mask	pad 不可见，不算 loss/logprob
`labels`	response tokens, prompt as `-100`	response tokens, prompt as `-100`	logprob 只算 response
`response_mask`	assistant response span	assistant response span	不能包含 user/system tokens
`pair_id`	same id	same id	便于审计和 shuffle 后重组

在 DPO/RM 里，chosen 和 rejected 的 prompt 必须逐 token 相同。若两个分支的模板不同，比如一个多了空格、换行、<assistant>，sequence logprob 差就会混入格式因素。

Definition: Response Log Probability

The response log probability is the sum of token log probabilities only over the response span, conditioned on the prompt and previous response tokens: \[ \log \pi_\theta(y\mid x) = \sum_{t\in\mathcal{A}} \log\pi_\theta(y_t\mid x,y_{<t}). \]

一个 pair collator 可以先渲染两条完整序列，再检查 prompt prefix：

def check_pair_prefix(chosen_ids, rejected_ids, prompt_len):
    if not torch.equal(chosen_ids[:prompt_len], rejected_ids[:prompt_len]):
        raise ValueError("chosen/rejected prompts differ after tokenization")

def make_response_labels(input_ids, response_mask, attention_mask):
    labels = input_ids.clone()
    trainable = response_mask.bool() & attention_mask.bool()
    labels[~trainable] = -100
    return labels

Pitfall: Pairwise Losses Need Pairwise Collation

If chosen and rejected examples are independently shuffled without a stable pair id, the loss can compare responses from different prompts and silently become meaningless.

Reward Model

Reward model 给 prompt-response pair 一个标量：

\[ r_\phi(x,y)\in\mathbb{R}. \]

Bradley-Terry 模型把 reward 差转成 preference probability：

\[ P(y_w\succ y_l\mid x) = \sigma(r_\phi(x,y_w)-r_\phi(x,y_l)). \]

Reward model loss：

\[ \mathcal{L}_{\text{RM}} = - \log \sigma(r_\phi(x,y_w)-r_\phi(x,y_l)). \]

Proof: Pairwise Logistic Loss

假设偏好概率为

\[ P(y_w\succ y_l\mid x)=\sigma(\Delta r), \qquad \Delta r=r_\phi(x,y_w)-r_\phi(x,y_l). \]

观测到 winner 确实赢了，因此 negative log-likelihood 是：

\[ -\log P(y_w\succ y_l\mid x) = -\log\sigma(\Delta r). \]

Reward model 通常用 LM backbone 加 scalar head。实现时常取最后一个 answer token 的 hidden state，或对 answer token pooling 后接线性层。

Reward Model Implementation

一个 reward model 常写成：

class RewardModel(nn.Module):
    def __init__(self, backbone, hidden_size):
        super().__init__()
        self.backbone = backbone
        self.score = nn.Linear(hidden_size, 1, bias=False)

    def forward(self, input_ids, attention_mask, response_mask):
        out = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        h = out.hidden_states[-1]
        last_idx = last_response_index(response_mask)
        batch_idx = torch.arange(h.shape[0], device=h.device)
        pooled = h[batch_idx, last_idx]
        return self.score(pooled).squeeze(-1)

last_response_index 应该找的是最后一个 response token 在完整序列中的位置，而不是 response-local 位置：

def last_response_index(response_mask):
    pos = torch.arange(response_mask.size(1), device=response_mask.device)
    masked_pos = pos[None, :].masked_fill(~response_mask.bool(), -1)
    last = masked_pos.max(dim=1).values
    if (last < 0).any():
        raise ValueError("response_mask contains an empty response")
    return last

也可以对 response tokens 做 mean pooling：

\[ h_{\text{resp}} = \frac{\sum_t m_t h_t}{\sum_t m_t}. \]

Reward model 的输出只在差值上有意义。Bradley-Terry loss 对 reward 加同一个常数不敏感：

\[ (r_w+c)-(r_l+c)=r_w-r_l. \]

因此 reward scale/offset 需要额外校准，尤其是后面要把 reward 放进 PPO/GRPO 时。常见做法包括 reward normalization、per-batch whitening、固定 KL coefficient 或动态调节 KL coefficient。

Pitfall: Reward Scores Are Not Absolute Truth

Pairwise reward training identifies relative preferences more directly than calibrated absolute utilities. Treat raw reward magnitudes as training signals that need monitoring and normalization.

RLHF with KL Regularization

RLHF 优化的不是裸 reward，而是 reward 和 KL penalty：

\[ \max_\pi \mathbb{E}_{y\sim\pi(\cdot\mid x)} \left[ r_\phi(x,y) - \beta \log\frac{\pi(y\mid x)}{\pi_{\text{ref}}(y\mid x)} \right]. \]

第二项约束 policy 不要离 reference model 太远。否则 reward model 的漏洞会被 policy exploit。

Definition: KL-Controlled Policy Objective

KL-controlled policy optimization maximizes reward while penalizing divergence from a reference policy: \[ J(\pi)=\mathbb{E}_{y\sim\pi}[r(y)]-\beta\operatorname{KL}(\pi\|\pi_{\text{ref}}). \]

对语言模型，sequence-level KL 可以写成 token logprob 差：

\[ \log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} = \sum_t \left[ \log\pi_\theta(y_t\mid x,y_{<t}) - \log\pi_{\text{ref}}(y_t\mid x,y_{<t}) \right]. \]

这就是 RLHF 实现里为什么要同时保留 policy logprobs 和 reference logprobs。

PPO Objective

PPO 用旧 policy 采样，再对新 policy 做 clipped surrogate objective。令

\[ \rho_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)} {\pi_{\text{old}}(a_t\mid s_t)}. \]

PPO clipped objective：

\[ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( \rho_t(\theta)A_t, \operatorname{clip}(\rho_t(\theta),1-\epsilon,1+\epsilon)A_t \right) \right]. \]

Definition: Advantage

The advantage \(A_t\) estimates how much better an action is than the baseline value at state \(s_t\): \[ A_t\approx R_t-V(s_t). \]

在 LLM RLHF 中，state 是 prompt+partial response，action 是下一个 token。PPO 能在线采样并优化 reward，但工程复杂：

rollout generation；
reward scoring；
reference KL；
value model；
advantage estimation；
PPO epochs/minibatches；
KL/reward/entropy logging。

如果数据质量足够，很多场景会优先尝试 DPO 这类 offline preference optimization。

PPO Rollout and Advantage Construction

LLM PPO 不是把整段回答当成一个普通分类样本，而是一个 token-level trajectory：

state s_t: prompt + generated tokens before t
action a_t: generated token y_t
reward: usually sequence reward plus token-level KL penalties

实际 reward 常被拆成：

\[ r_t = -\beta \left( \log\pi_\theta(y_t\mid s_t) - \log\pi_{\text{ref}}(y_t\mid s_t) \right), \]

最后一个 token 再加 reward model score：

\[ r_T \leftarrow r_T + r_\phi(x,y). \]

这样每个 token 都承担 KL 成本，最终 answer 承担偏好 reward。若有 value model \(V_\psi(s_t)\)，可以用 GAE：

\[ \delta_t = r_t+\gamma V_\psi(s_{t+1})-V_\psi(s_t), \]

\[ A_t = \sum_{l=0}^{T-t} (\gamma\lambda)^l\delta_{t+l}. \]

PPO batch 里通常要保存：

Tensor	Meaning
`input_ids`	prompt + sampled response
`response_mask`	generated token positions
`old_logprobs`	rollout policy logprobs
`ref_logprobs`	frozen reference logprobs
`rewards`	reward model + KL-shaped token rewards
`values`	value model predictions
`advantages`	normalized advantage estimates

Pitfall: PPO Needs the Rollout Logprobs

The PPO ratio compares the updated policy to the policy that generated the tokens. Recomputing only current logprobs is not enough; old_logprobs must be stored with the rollout batch.

PPO Token Masks and KL Rewards

LLM PPO 里的每个 token 都是一个 action，但不是每个 token 都应该进入 PPO loss。prompt tokens 是条件，response tokens 才是 policy rollout。设 response mask 为 \(m_t\in\{0,1\}\)，则 token-level ratio 是

\[ \rho_t(\theta) = \exp\left[ \log\pi_\theta(y_t\mid s_t) - \log\pi_{\text{old}}(y_t\mid s_t) \right], \qquad t\in\mathcal{A}. \]

实现中要先 gather 已采样 token 的 logprob：

def gather_token_logprobs(logits, labels, response_mask):
    # logits: [B, T, V], labels: [B, T]
    logp = torch.log_softmax(logits[:, :-1], dim=-1)
    target = labels[:, 1:].clamp_min(0)
    mask = response_mask[:, 1:].bool()
    tok_logp = logp.gather(-1, target[..., None]).squeeze(-1)
    return tok_logp, mask

注意这里和 SFT 一样有 label shift：位置 \(t-1\) 的 logits 预测位置 \(t\) 的 token。response_mask[:, 1:] 必须和 shifted labels 对齐。

KL shaping 常用 sampled-action estimator：

\[ k_t = \log\pi_\theta(y_t\mid s_t) - \log\pi_{\text{ref}}(y_t\mid s_t). \]

这不是完整 vocab 上的 exact KL，而是 rollout token 上的 log-ratio。它足够便宜，也正好能形成 per-token penalty：

\[ r_t^{\text{KL}}=-\beta k_t. \]

一个具体构造：

def build_ppo_rewards(scores, logp, ref_logp, response_mask, beta):
    # scores: [B], scalar reward model score for whole response
    mask = response_mask[:, 1:].bool()
    rewards = -beta * (logp - ref_logp)
    rewards = rewards * mask

    lengths = mask.long().sum(dim=1)
    if lengths.eq(0).any():
        raise ValueError("PPO reward construction received an empty response")

    last = lengths.sub(1)
    batch = torch.arange(mask.size(0), device=mask.device)
    rewards[batch, last] += scores
    return rewards

这个保护很重要：若某条样本 response 为空，last=-1 会把 reward 加到最后一个 padding token 上。因此 rollout 阶段应拒绝空 response，或在 reward 构造时显式报错。

Pitfall: KL Estimator Scope Must Be Logged

Token log-ratio on sampled actions, exact categorical KL over the vocabulary, and sequence-level KL are different quantities. Log which one your PPO loop uses.

一个极简训练循环：

sample responses with policy_old
compute old_logprobs, ref_logprobs, rewards, values
compute advantages and returns
for ppo_epoch:
    for minibatch:
        recompute policy logprobs and values
        optimize clipped policy loss + value loss - entropy bonus
monitor KL, reward, length, clip fraction

DPO Derivation

DPO 从 KL-regularized reward objective 出发。固定 prompt \(x\)，最优 policy 满足：

\[ \pi^\star(y\mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y\mid x) \exp\left(\frac{1}{\beta}r(x,y)\right). \]

整理得到 reward 与 policy ratio 的关系：

\[ r(x,y) = \beta \log \frac{\pi^\star(y\mid x)} {\pi_{\text{ref}}(y\mid x)} + \beta\log Z(x). \]

对同一个 prompt 的 winner/loser 做差，\(Z(x)\) 抵消：

\[ r(x,y_w)-r(x,y_l) = \beta \left[ \log\frac{\pi^\star(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \log\frac{\pi^\star(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right]. \]

用当前 policy \(\pi_\theta\) 参数化 \(\pi^\star\)，带入 Bradley-Terry preference model：

\[ P_\theta(y_w\succ y_l\mid x) = \sigma \left( \beta \left[ \log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right] \right). \]

DPO loss：

\[ \mathcal{L}_{\text{DPO}} = - \log \sigma \left( \beta \left[ \Delta\log\pi_\theta - \Delta\log\pi_{\text{ref}} \right] \right), \]

其中

\[ \Delta\log\pi_\theta = \log\pi_\theta(y_w\mid x)-\log\pi_\theta(y_l\mid x). \]

DPO Gradient Intuition

令

\[ z = \beta \left[ (\log\pi_\theta(y_w\mid x)-\log\pi_\theta(y_l\mid x)) - (\log\pi_{\text{ref}}(y_w\mid x)-\log\pi_{\text{ref}}(y_l\mid x)) \right]. \]

DPO loss 是

\[ \ell(z)=-\log\sigma(z). \]

它的导数为

\[ \frac{\partial \ell}{\partial z} = \sigma(z)-1. \]

因此

\[ \frac{\partial \ell}{\partial \log\pi_\theta(y_w\mid x)} = \beta(\sigma(z)-1), \]

\[ \frac{\partial \ell}{\partial \log\pi_\theta(y_l\mid x)} = \beta(1-\sigma(z)). \]

梯度下降会增加 chosen logprob，降低 rejected logprob；当 \(z\) 已经很大时，\(\sigma(z)\approx1\)，梯度趋近 0，说明 pair 已经被模型分开。\(\beta\) 控制这个 margin 的斜率：\(\beta\) 太小，信号弱；\(\beta\) 太大，loss 很快饱和且对噪声 pair 更敏感。

Proof

因为

\[ \ell(z)=-\log\sigma(z), \]

且

\[ \frac{d}{dz}\log\sigma(z)=1-\sigma(z), \]

所以

\[ \frac{\partial\ell}{\partial z}=\sigma(z)-1. \]

又因为

\[ \frac{\partial z}{\partial \log\pi_\theta(y_w\mid x)}=\beta, \qquad \frac{\partial z}{\partial \log\pi_\theta(y_l\mid x)}=-\beta, \]

链式法则给出上面的两个梯度。

Proof Sketch

KL-regularized objective 对每个 prompt 是：

\[ \max_\pi \sum_y\pi(y) \left[ r(y)-\beta\log\frac{\pi(y)}{\pi_{\text{ref}}(y)} \right]. \]

加入约束 \(\sum_y\pi(y)=1\)，对 \(\pi(y)\) 求一阶条件：

\[ r(y)-\beta\left(\log\frac{\pi(y)}{\pi_{\text{ref}}(y)}+1\right)+\lambda=0. \]

整理：

\[ \pi(y)\propto \pi_{\text{ref}}(y)\exp(r(y)/\beta). \]

再把 reward 差带入 Bradley-Terry preference probability，就得到 DPO loss。

DPO Batch Implementation

核心是计算 chosen/rejected response 的 sequence logprob：

def sequence_logprob(model, batch):
    out = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
    )
    logits = out.logits[:, :-1, :]
    labels = batch["labels"][:, 1:]
    mask = labels.ne(-100)
    labels = labels.clamp_min(0)

    logp = torch.log_softmax(logits, dim=-1)
    tok_logp = logp.gather(-1, labels.unsqueeze(-1)).squeeze(-1)
    return (tok_logp * mask).sum(dim=-1)

DPO loss：

pi_w = sequence_logprob(policy, chosen)
pi_l = sequence_logprob(policy, rejected)

with torch.no_grad():
    ref_w = sequence_logprob(reference, chosen)
    ref_l = sequence_logprob(reference, rejected)

logit = beta * ((pi_w - pi_l) - (ref_w - ref_l))
loss = -torch.nn.functional.logsigmoid(logit).mean()

实现坑：

reference model 必须 frozen；
prompt tokens 和 pad tokens 通常设为 -100；
chosen/rejected 的 logprob 要在同样模板下计算；
sequence logprob 是 sum 还是 length-normalized 要和 recipe 一致；
\(\beta\) 控制偏离 reference 的强度。
chosen/rejected 必须保持 pair 对齐，不能跨 prompt 比较；
labels 的 shift、response_mask 的 shift 和 attention_mask 必须一致。

Pitfall: DPO Is Sensitive to Formatting

If chosen and rejected responses are tokenized under different chat templates, the logprob comparison includes formatting artifacts, not only response quality.

Length Normalization and Label Masks

DPO 的 sequence logprob 可以用 sum，也可以用 length-normalized average：

\[ \log\pi(y\mid x) = \sum_{t\in\mathcal{A}} \log\pi(y_t\mid x,y_{<t}) \]

或

\[ \bar{\ell}(y\mid x) = \frac{1}{|\mathcal{A}|} \sum_{t\in\mathcal{A}} \log\pi(y_t\mid x,y_{<t}). \]

两者目标不同。sum 会自然惩罚长回答，因为更多 token logprob 相加通常更负；average 更关注单位 token 质量，但可能鼓励变长。无论选择哪种，chosen 和 rejected 必须使用同一个 response mask 规则：

mask = labels.ne(-100)
seq_logp = (tok_logp * mask).sum(-1)
seq_len = mask.sum(-1).clamp_min(1)
avg_logp = seq_logp / seq_len

Pitfall: DPO Can Learn Length Bias

If chosen responses are systematically longer or shorter than rejected responses, sequence logprob conventions can turn length into a shortcut feature.

GRPO and Verifiable Rewards

GRPO-style training removes the value model by comparing multiple rollouts from the same prompt. For each prompt \(x\), sample a group:

\[ y_1,\ldots,y_G\sim\pi_{\theta_{\text{old}}}(\cdot\mid x). \]

Score each response with a verifier or reward:

\[ R_i=R(x,y_i). \]

Then define group-relative advantages:

\[ A_i = \frac{R_i-\operatorname{mean}(R_1,\ldots,R_G)} {\operatorname{std}(R_1,\ldots,R_G)+\epsilon}. \]

The policy ratio can be sequence-level or token-averaged depending on implementation:

\[ \rho_i(\theta) = \exp \left( \log\pi_\theta(y_i\mid x) - \log\pi_{\theta_{\text{old}}}(y_i\mid x) \right). \]

A clipped objective mirrors PPO:

\[ \mathcal{L}_{\text{GRPO}} = - \mathbb{E}_i \left[ \min \left( \rho_i A_i, \operatorname{clip}(\rho_i,1-\epsilon,1+\epsilon)A_i \right) \right] + \beta\,\operatorname{KL}(\pi_\theta\|\pi_{\text{ref}}). \]

GRPO is attractive for math/code/verifiable tasks because reward can come from exact answer checkers, unit tests, or symbolic verifiers. It avoids training a separate value head, but it still needs careful sampling, reward normalization, KL control, and pass-rate monitoring.

Definition: Verifiable Reward

A verifiable reward is produced by an external checker, such as exact answer matching, unit tests, formal validation, or a task-specific verifier, rather than only by human preference labels.

Pitfall: Group Advantage Depends on Sampling Temperature

If all samples in a group are nearly identical, group-relative advantages collapse. If sampling is too noisy, rewards become high variance. GRPO quality depends strongly on rollout diversity.

Offline Preference Variants

DPO is not the only offline preference objective. The family differs mainly in how it shapes the margin between chosen and rejected responses.

Method	Core signal	Intuition
DPO	logistic of policy-ratio margin	implicit reward model from KL-control
IPO	squared preference margin target	avoids some overconfident logistic saturation
ORPO	SFT-like likelihood plus odds-ratio penalty	no separate reference model in the same way
SimPO	length-normalized policy margin	simpler objective, explicit target margin

The practical question is less “which acronym is newest” and more:

Do we have a frozen reference model?
Are chosen/rejected lengths biased?
Is preference data on-policy or stale?
Does the task need exploration or only ranking correction?
Are evals sensitive to style, truthfulness, safety, or exact correctness?

This is why the same base model may use SFT for format, DPO-like training for preference style, and GRPO/RL for verifiable reasoning.

SFT vs. DPO vs. PPO

Method	Data	Objective	Strength	Risk
SFT	demonstrations	next-token CE on answer	stable, simple	imitates data quality
Reward model	preference pairs	pairwise logistic	learns evaluator	reward hacking if optimized too hard
PPO-RLHF	sampled responses + reward	KL-controlled RL	can optimize online reward	complex, high variance
DPO	preference pairs	logistic over policy ratios	simple offline training	depends on preference data coverage
GRPO	grouped sampled responses + verifier/reward	group-relative clipped policy objective	strong for verifiable tasks	rollout variance and reward hacking

实践上可以按顺序升级：

SFT 能解决就先 SFT；
需要偏好对齐但不想引入在线 RL，先 DPO；
需要在线探索、工具反馈、可验证 reward，再考虑 PPO/GRPO/RL。

Preference Data Quality

偏好数据不只是三元组格式，关键是 candidate generation 和 annotation policy。一个 preference pair 的分布可以写成：

\[ x\sim p_{\text{prompt}}, \qquad y_1,y_2\sim q_{\text{candidate}}(\cdot\mid x), \qquad y_w\succ y_l\sim h(\cdot\mid x,y_1,y_2). \]

这里 \(q_{\text{candidate}}\) 决定比较难度。如果两个候选一个明显坏、一个明显好，模型很快学会粗粒度偏好；如果两个候选质量接近，信号更细但标注噪声更大。

常见数据问题：

Issue	Symptom	Consequence
trivial pairs	chosen always much better	weak fine-grained learning
length bias	chosen longer/more verbose	model learns verbosity
template artifacts	chosen/rejected formatting differs	objective learns formatting shortcut
stale negatives	rejected from weak old model	little pressure on current failures
annotator disagreement	inconsistent winners	noisy reward/preference signal
benchmark contamination	eval answers in training	inflated win rate

Preference training 前应该先做数据审计，而不是直接跑 loss。

What to Log

Post-training 不能只看 loss：

Metric	Meaning
chosen reward / rejected reward	reward model 是否分开偏好
DPO accuracy	logit 是否把 chosen 排到 rejected 前
KL to reference	policy 是否漂移过大
response length	是否通过变长作弊
win rate	human/model evaluator preference
refusal rate	安全策略是否过度拒答
format violation rate	chat template/tool-call 是否稳定
reward-model margin	RM 是否过度自信或塌缩
PPO/GRPO clip fraction	policy update 是否太激进
entropy	采样是否过早变窄
pass@k / verifier score	可验证任务是否真的变好

偏好优化的危险在于 reward 或 preference signal 只是代理目标。模型可能学会讨好 evaluator、拉长回答、套格式、回避困难问题，而不是真的变好。

Implementation Checklist

Post-training 前至少逐项确认：

chat template 与 tokenizer、训练数据、推理服务一致；
SFT labels 只覆盖预期 assistant tokens；
packed conversations 是否有 block boundary；
reward model pooling 位置是否落在 response tokens 上；
reward scale 是否 normalization/calibration；
PPO rollout 是否保存 old logprobs、ref logprobs、values 和 response masks；
DPO chosen/rejected 是否使用同一模板和同一 logprob convention；
sequence logprob 是 sum 还是 length-normalized；
reference model 是否 frozen；
KL、length、entropy、format violation 和 task eval 是否同时监控；
preference data 是否审计过 length/template/stale-negative bias；
DPO/RM pair collator 是否检查 chosen/rejected prompt prefix 完全一致；
PPO 的 old/ref/current logprobs 是否使用同一个 shifted response mask；
PPO KL 指标到底是 sampled log-ratio、exact token KL 还是 sequence KL；
online RL 是否有明确 verifier/reward 和停止条件。

References

Training language models to follow instructions with human feedback, Ouyang et al.
Proximal Policy Optimization Algorithms, Schulman et al.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Rafailov et al.