2.8 Discrete Modeling and Training Paradigms

前面的章节介绍了 CNN、RNN、GNN、GAN、Diffusion 等结构，但深度学习真正重要的不只是结构，还有训练范式。现代模型能力很大程度上来自目标函数、数据构造、监督信号和采样过程的组合。

Start from One Sentence

考虑一句话：

the cat sat on the mat

同一份数据可以导出多种训练任务：

Paradigm	Constructed input	Target
supervised classification	sentence	label, e.g. positive/negative
autoregressive LM	`the cat sat`	next token `on`
masked modeling	`the [MASK] sat on the mat`	token `cat`
contrastive learning	sentence + image	matched pair
preference learning	two generated continuations	preferred one

所以“训练范式”不是抽象名词，而是你如何从原始数据构造 input-target pair。模型结构可以相同，但任务构造不同，学到的行为就不同。

A Training Paradigm Is a Data-Objective Pipeline

Definition: Training Paradigm

A training paradigm specifies how raw data are transformed into model inputs, prediction targets, losses, sampling distributions, and evaluation protocols. It is the bridge between a model architecture and the behavior learned by that model.

一个训练范式至少包含五件事：

raw data distribution \(p_{\text{data}}\)；
task construction \(q(c,y\mid x)\)，把原始样本变成 context \(c\) 和 target \(y\)；
model distribution \(p_\theta(y\mid c)\)；
loss or reward；
sampling/evaluation protocol。

可以写成统一形式：

\[ \min_\theta \mathbb{E}_{x\sim p_{\text{data}}} \mathbb{E}_{(c,y)\sim q(\cdot\mid x)} \left[ \ell_\theta(c,y) \right]. \]

这条式子非常重要。它说明训练范式的差异往往不在神经网络层，而在 \(q(c,y\mid x)\)：同一段文本可以被切成 next-token pair，也可以随机 mask 成 denoising pair，也可以和图片组成 contrastive pair，还可以由模型生成两个回答再变成 preference pair。

From a Record to a Batch

如果只说“输入一句话，预测一个目标”，还是太抽象。训练范式真正落地时，一条 raw record 会经历四层变换：

raw record -> example builder -> collator -> loss reducer

以文本 the cat sat on the mat 为例，假设 tokenizer 给出：

token:  <bos> the cat sat on the mat <eos>
id:       1   42  95 317 18  42 510   2

autoregressive LM 的一个训练样本可以写成：

input_ids: [1, 42, 95, 317, 18, 42, 510]
labels:    [42, 95,317, 18, 42,510,   2]
mask:      [1,  1,  1,  1,  1,  1,   1]

masked LM 的同一条 record 可能变成：

input_ids: [1, 42, MASK, 317, 18, 42, 510, 2]
labels:    [-100,-100,95,-100,-100,-100,-100,-100]
mask:      [0,   0,   1, 0,   0,   0,   0,   0]

instruction tuning 又可能变成：

input_ids: [<user>, explain, clipping, <assistant>, clipping, rescales, ...]
labels:    [-100,   -100,    -100,     -100,        clipping, rescales, ...]
mask:      [0,      0,       0,        0,           1,        1,        ...]

三者的模型 forward 可能都叫：

logits = model(input_ids, attention_mask=attention_mask).logits
loss = cross_entropy(logits, labels, ignore_index=-100)

但它们不是同一个任务。labels == -100、attention_mask、position id、document boundary mask、sample weight 共同定义了经验目标：

\[ \widehat{\mathcal{L}}(\theta) = \frac{ \sum_{b=1}^{B}\sum_{t=1}^{T} m_{b,t}\, \ell_\theta(c_{b,t},y_{b,t}) }{ \sum_{b=1}^{B}\sum_{t=1}^{T}m_{b,t} }. \]

分母也很关键。若按 batch 平均而不是有效 token 平均，短样本 batch 和长样本 batch 的权重不同；若 DDP 每张卡先求 mean 再平均，不同卡有效 token 数不一致时也会改变总体 objective。

Contract: Collator Defines the Empirical Objective

For sequence models, input_ids, labels, loss masks, attention masks, position ids, and sample weights together define the empirical training objective. The model class only consumes tensors; the collator decides what event is being modeled.

Pitfall: Objective and Data Construction Cannot Be Separated

Saying “we train with cross entropy” is incomplete. Cross entropy over next tokens, masked tokens, class labels, or preference-derived targets creates different learned behavior.

Continuous vs. Discrete Modeling

Definition: Discrete Modeling

Discrete modeling handles variables whose values are drawn from a finite or countable set, such as token ids, class labels, graph edges, code symbols, actions, or masked categorical states.

图像像素常被当成连续值或离散 8-bit 值；文本 token 是离散 categorical variable；graph 的 edge existence 是离散 Bernoulli variable；RL action 也常是离散 choice。

这会影响训练方式。连续变量常用 MSE、Gaussian likelihood、score matching；离散变量常用 cross entropy、categorical likelihood、mask prediction、contrastive objectives。

离散建模的核心困难是：目标变量不能被小幅连续移动。一个 token id 错了就是另一个类别；一条 edge 存不存在是 Bernoulli；一个 action 被 sample 后通常不可微。因此常见做法是让模型输出一个分布：

\[ p_\theta(y=k\mid c) = \operatorname{softmax}_k(z_\theta(c)). \]

训练时用 log-likelihood 给所有类别 logits 提供梯度；推理时再进行 argmax、sampling、beam search 或 constrained decoding。

离散变量训练时通常不会对 sampled token 反传，而是对完整 categorical distribution 反传。对单个样本，令

\[ p_k = \frac{\exp z_k}{\sum_j\exp z_j}, \qquad \ell = -\log p_y. \]

则

\[ \frac{\partial \ell}{\partial z_k} = p_k-\mathbb{1}[k=y]. \]

这个梯度同时推低错误类别、推高正确类别。它解释了为什么 next-token prediction 虽然每个位置只有一个 target token，但每次更新会调整整个 vocabulary 的 logit geometry。若 vocabulary 很大，计算这个梯度的代价也很大，所以后面会出现 sampled softmax、vocab parallel CE、chunked CE 等工程技巧。

Derivation: Cross-Entropy Gradient for a Categorical Target

设

\[ \ell = -z_y+\log\sum_j\exp z_j. \]

对任意 \(k\) 求导：

\[ \frac{\partial \ell}{\partial z_k} = -\mathbb{1}[k=y] + \frac{\exp z_k}{\sum_j\exp z_j} = p_k-\mathbb{1}[k=y]. \]

所以 CE 的梯度不是只作用在正确类别上，而是把 model distribution \(p\) 拉向 one-hot target。

Definition: Teacher Signal

A teacher signal is the target information used to train a model. It may come from human labels, future tokens, masked tokens, another model, rewards, preferences, or environment feedback.

训练范式的一个实用分类方式是 teacher signal 从哪里来：

Teacher signal	Example	Typical objective
human label	image class	supervised CE
future token	language modeling	autoregressive CE
hidden part of input	masked token	denoising CE
paired data	image-text pair	contrastive
another model	distillation	KL / CE
human preference	chosen vs rejected	pairwise preference
reward/environment	RL task	policy gradient

Supervised Learning

最基本范式是

\[ \min_\theta \mathbb{E}_{(x,y)} \ell(f_\theta(x),y). \]

分类使用 cross entropy：

\[ \ell = -\sum_{k=1}^{K}y_k\log p_\theta(k\mid x). \]

它本质上是在最大化正确离散标签的 log-likelihood。

监督学习最容易理解，但也最容易被误用。它假设训练标签 \(y\) 是你真正想预测的变量。如果标签本身是 noisy、ambiguous 或由规则系统生成，那么模型学到的是这个 labeling process，而不一定是现实世界的真值。

对单标签分类，目标是

\[ \min_\theta \mathbb{E}_{(x,y)} [-\log p_\theta(y\mid x)]. \]

对多标签分类，目标常变成多个 independent Bernoulli：

\[ \mathcal{L} = - \sum_{k} y_k\log p_{\theta,k} + (1-y_k)\log(1-p_{\theta,k}). \]

这里 target construction 改变了概率假设：单标签 CE 假设类别互斥，多标签 BCE 假设每个标签可独立出现。

Pitfall: Multiclass and Multilabel Are Different Paradigms

Using softmax cross entropy for a multilabel task forces probability mass to compete across labels. Using independent BCE for a mutually exclusive task loses the simplex constraint.

Supervision Is Often a Distribution

真实项目里的 label 往往不是干净 one-hot。弱标注、多个 annotator、teacher model、规则系统都可能给出一个 target distribution \(q(y\mid x)\)。这时监督目标更自然地写成：

\[ \mathcal{L}_{\text{sup}} = \mathbb{E}_{x} \left[ H(q(\cdot\mid x),p_\theta(\cdot\mid x)) \right] = \mathbb{E}_{x} \left[ H(q)+ \operatorname{KL}(q\Vert p_\theta) \right]. \]

如果 \(q\) 是 one-hot，就退化成普通 CE；如果 \(q\) 是 annotator vote distribution，模型学到的是“标注者群体的不确定性”；如果 \(q\) 是 teacher logits，模型学到的是 teacher 的暗知识和偏差。

Definition: Label Semantics

Label semantics specify what the target represents: ground truth, annotator consensus, rule output, future token, hidden token, reward preference, or teacher belief. Changing label semantics changes the learned behavior even when the loss formula is unchanged.

一个具体例子是医学多标签分类。假设影像可能同时有 pneumonia 和 edema，则 target 是 multi-hot：

labels = [0, 1, 1, 0, 0]

此时使用 softmax 会强迫五个病种共享一份概率质量；模型为了提高 pneumonia 概率，必须压低 edema，这和任务事实冲突。相反，若是 MNIST 数字分类，使用 independent BCE 会允许一个图像同时高概率是 3 和 8，失去互斥结构。

监督范式的工程检查不是“loss 是否下降”，而是：

label 的语义是否和概率假设一致；
多个 annotator 是否被粗暴 majority vote 掩盖了不确定性；
类别频率是否让经验风险被大类主导；
label smoothing 是否在错误地削弱罕见但确定的标签；
sample weight 是否在 distributed reduction 后仍然保持总权重语义。

Autoregressive Modeling

AR 模型把 joint distribution 写成 chain rule：

\[ p_\theta(x_{1:T}) = \prod_{t=1}^{T}p_\theta(x_t\mid x_{<t}). \]

训练目标是 next-token prediction：

\[ \mathcal{L}_{\text{AR}} = -\sum_{t=1}^{T}\log p_\theta(x_t\mid x_{<t}). \]

RNN 和 GPT 的训练目标在这个层面是一致的；区别在于 context representation 从 recurrent state 变成了 self-attention。

从 chain rule 到 batch tensor 还有一个容易出错的 shift 约定。两种常见实现是：

external shift:
input_ids = x[0:T-1]
labels    = x[1:T]

internal shift:
input_ids = x[0:T]
labels    = x[0:T]
model/loss internally compares logits[:, :-1] with labels[:, 1:]

两种都可以，但不能叠加。若 dataset 已经把 labels shift 过，而 model forward 又内部 shift 一次，target 会偏移两格；loss 仍然可能下降，但模型学到的是错误条件事件。

Pitfall: Double Shift

Check whether label shifting happens in the dataset, collator, model forward, or loss function. A double shift can train smoothly while supervising the wrong token.

AR objective 的另一个细节是 prefix 长度。有限 context 模型实际优化的是

\[ \prod_{t=1}^{T} p_\theta(x_t\mid x_{\max(1,t-L):t-1}), \]

而不是完整历史 \(x_{<t}\)。因此 long-context pretraining、sliding window attention、Transformer-XL recurrence、retrieval augmentation 都可以看成是在改变可用 context \(c_t\)，从而改变 \(p_\theta(y\mid c)\)。

Teacher Forcing and Exposure Bias

Autoregressive training 通常使用 teacher forcing：训练时第 \(t\) 步条件是真实 prefix \(x_{<t}\)，而不是模型自己采样的 prefix。

训练目标：

\[ \mathcal{L}_{\text{TF}} = - \sum_t \log p_\theta(x_t\mid x_{<t}^{\text{gold}}). \]

推理过程：

\[ \hat{x}_t \sim p_\theta(\cdot\mid \hat{x}_{<t}). \]

两者的条件分布不同。训练时模型总看到干净历史；生成时模型看到自己犯过错的历史。这种 mismatch 称为 exposure bias。

Definition: Exposure Bias

Exposure bias is the mismatch between teacher-forced training, where the model conditions on ground-truth prefixes, and autoregressive generation, where it conditions on its own sampled prefixes.

缓解方法包括 scheduled sampling、sequence-level training、RL fine-tuning、minimum risk training，或者在 post-training 中让模型在自己生成的 states 上接受反馈。现代 LLM 的 SFT + preference/RL pipeline，本质上也是在修正纯 next-token pretraining 与真实交互分布之间的 mismatch。

scheduled sampling 的想法是训练时偶尔用模型自己的预测作为下一步输入：

\[ c_t = \begin{cases} x_{<t}^{\text{gold}}, & \text{with probability } \alpha,\\ \hat{x}_{<t}, & \text{with probability } 1-\alpha. \end{cases} \]

这直观上让模型见到自己的错误历史，但它也改变了训练分布：target 仍然是 gold token，context 却可能来自模型采样路径。这个目标不再是纯数据分布下的最大似然，而是一个依赖当前 policy 的混合 objective。实践中它能否稳定收益取决于任务、采样策略和 \(\alpha\) schedule。

Teacher Forcing Is a Parallelization Contract

Teacher forcing is not only a modeling choice; it enables parallel training because all gold prefixes are known. Replacing gold prefixes with sampled prefixes makes the computation sequential or partially sequential.

Packing, Prompt Loss, and Completion Loss

语言模型训练里，cross entropy 还不够描述任务。必须说明哪些 token 计入 loss。

对于 instruction tuning，样本可能是：

<user> Explain gradient clipping.
<assistant> Gradient clipping rescales ...

常见选择是只对 assistant completion 计算 loss：

\[ L = - \sum_{t\in \mathcal{A}} \log p_\theta(x_t\mid x_{<t}), \]

其中 \(\mathcal{A}\) 是 assistant token positions。User prompt token 参与 attention context，但不参与 target loss。

Pitfall: Prompt Tokens and Completion Tokens Have Different Roles

Prompt tokens are conditioning context; completion tokens are supervised targets. Accidentally training on prompt tokens can teach the model to imitate user text rather than answer it.

Packing 多个样本到同一序列时，还要决定是否允许跨样本 attention。若允许：

sample A <eos> sample B <eos>

sample B 可以读到 sample A。若不允许，需要 block-diagonal attention mask。这不是纯效率细节，而是改变了条件分布。

padding side 也会改变 position contract。右 padding 常见于训练：

ids:  [BOS, a, b, EOS, PAD, PAD]
pos:  [0,   1, 2, 3,   0,   0  ]
mask: [1,   1, 1, 1,   0,   0  ]

左 padding 常见于 batched decode，因为不同长度 prompt 右对齐后最后一个 token 在同一列：

ids:  [PAD, PAD, BOS, a, b, EOS]
pos:  [0,   0,   0,   1, 2, 3  ]
mask: [0,   0,   1,   1, 1, 1  ]

若 position ids 简单写成 arange(T)，左 padding 会让同一句话的真实 token 获得更大的 position id；RoPE/absolute position 都会受影响。正确做法通常是从 attention mask 构造 compact position：

position_ids = attention_mask.long().cumsum(dim=-1) - 1
position_ids = position_ids.clamp_min(0)

Pitfall: Padding Is Not Just Cosmetic

Left padding and right padding can produce different position ids, cache offsets, and loss masks. For decoder-only models, padding policy must be consistent across training, evaluation, and generation.

Denoising and Masked Modeling

BERT、masked autoencoder、masked diffusion language model 都可以看成 denoising：

\[ \tilde{x}\sim q(\tilde{x}\mid x), \qquad \max_\theta \log p_\theta(x_{\text{masked}}\mid \tilde{x}). \]

对于 discrete token，corruption 可以是 [MASK]、random replacement、deletion、permutation。对于 continuous data，corruption 常是 Gaussian noise。

Definition: Denoising Objective

A denoising objective corrupts clean data \(x\) into \(\tilde{x}\) and trains a model to reconstruct \(x\) or a target derived from \(x\): \[ \mathcal{L}_{\text{denoise}} = \mathbb{E}_{q(\tilde{x}\mid x)} [-\log p_\theta(x\mid \tilde{x})]. \]

Denoising 范式由 corruption process 决定。对文本：

\[ q(\tilde{x}\mid x) \]

可以随机 mask、span mask、delete、replace、shuffle。不同 corruption 学到的能力不同：

Corruption	Learned behavior
token mask	local lexical recovery
span mask	phrase/sentence infilling
deletion	robustness to missing content
permutation	order reconstruction
high mask ratio	global semantic planning

对于 masked diffusion LM，mask ratio 还对应 noise level \(t\)。训练不是固定一个 mask ratio，而是让模型在不同噪声强度下做恢复。

更形式化地，离散 corruption 可以写成一个 transition matrix：

\[ Q_t(\tilde{x}_i=a\mid x_i=b). \]

最简单的 absorbing-mask corruption 是：

\[ Q_t(\tilde{x}_i=[\mathrm{MASK}]\mid x_i=b)=\alpha_t, \qquad Q_t(\tilde{x}_i=b\mid x_i=b)=1-\alpha_t. \]

这和连续 diffusion 里的噪声强度类似：\(t\) 越大，\(\alpha_t\) 越高，可见信息越少。训练目标常写成只在 masked positions 上算 CE：

\[ \mathcal{L} = \mathbb{E}_{t,x,\tilde{x}} \left[ \frac{1}{|M_t|} \sum_{i\in M_t} -\log p_\theta(x_i\mid \tilde{x},t) \right], \]

其中 \(M_t=\{i:\tilde{x}_i=[\mathrm{MASK}]\}\)。若不除以 \(|M_t|\)，高 mask ratio 的样本会因为 target 更多而获得更大权重；若除以 \(|M_t|\)，每条样本权重更均匀。这不是无关紧要的 reduction 细节，而是 time-weighting policy。

BERT-Style Corruption Is More Than `[MASK]`

BERT 的经典 masked LM 并不是所有选中 token 都替换成 [MASK]。常见策略是：选中 15% token，其中 80% 替换为 [MASK]，10% 替换为随机 token，10% 保持原 token，但 target 仍然是原始 token。

clean:      the cat sat on the mat
corrupted:  the [MASK] sat on apple mat
labels:    -100 cat -100 -100 the -100

保留一部分原 token 的原因是减少 pretrain-finetune mismatch：下游微调和推理时通常没有 [MASK] token。随机替换则迫使模型不要只把 [MASK] 位置当作唯一需要预测的位置，而要学习“这个位置可能被污染”的语义。

span corruption 进一步改变 target 粒度。T5-style denoising 会把连续 span 替换成 sentinel token：

clean:   the cat sat on the mat
input:   the <extra_id_0> on the mat
target:  <extra_id_0> cat sat <extra_id_1>

这时模型不只是做局部分类，而是在生成缺失片段。目标从 token-level masked CE 变成 conditional sequence modeling，接近 encoder-decoder 的 text infilling。

Definition: Corruption Process

The corruption process is the stochastic rule that maps clean data to corrupted inputs and reconstruction targets. In denoising models, it is part of the model specification, not a preprocessing footnote.

Theorem: Denoising Learns Conditional Structure

If a model is trained to predict randomly hidden parts of data from visible parts, then the optimal predictor recovers the true conditional distribution of hidden variables given visible variables under the corruption process.

Proof

设 corruption process 产生 visible context \(c\) 和 hidden target \(y\)。Denoising CE 为

\[ \mathbb{E}_{(c,y)} [-\log p_\theta(y\mid c)]. \]

对每个固定 \(c\)，该期望是

\[ \mathbb{E}_{y\sim p_{\text{data}}(\cdot\mid c)} [-\log p_\theta(y\mid c)] = H(p_{\text{data}}(\cdot\mid c)) + \operatorname{KL} (p_{\text{data}}(\cdot\mid c)\Vert p_\theta(\cdot\mid c)). \]

第一项与 \(\theta\) 无关，最小化发生在

\[ p_\theta(y\mid c)=p_{\text{data}}(y\mid c). \]

Contrastive Learning

Contrastive learning 通过正负样本比较学习 representation。InfoNCE loss：

\[ \mathcal{L} = - \log \frac{\exp(\operatorname{sim}(q,k^+)/\tau)} {\exp(\operatorname{sim}(q,k^+)/\tau)+ \sum_{j}\exp(\operatorname{sim}(q,k_j^-)/\tau)}. \]

它在视觉、图、文本检索、多模态对齐中都非常重要。CLIP 就是 image-text contrastive learning 的代表。

Negative Sampling Is the Task

Contrastive learning 的行为高度依赖 negative sampling。若 negatives 太容易，模型只学到粗糙特征；若 negatives 太难且含 false negatives，训练会惩罚本该相似的样本。

对于 batch 内 negatives，若 batch size 是 \(B\)，每个 query 有 \(B-1\) 个 negatives：

\[ \mathcal{L}_i = - \log \frac{\exp(q_i^\top k_i/\tau)} {\sum_{j=1}^{B}\exp(q_i^\top k_j/\tau)}. \]

这本质上把 batch 构造成一个 \(B\) 类分类问题：第 \(i\) 个 query 的正确 class 是第 \(i\) 个 key。

Definition: False Negative

A false negative is a sample treated as a negative example even though it shares the same semantic class or should be considered compatible with the query.

在多模态检索里，false negative 很常见：两张不同图片可能都匹配同一句描述；两段不同文本可能语义相同。训练范式必须处理这个数据事实，而不是只写 InfoNCE 公式。

InfoNCE 的梯度也说明了 temperature 和 batch negatives 为什么重要。令

\[ s_{ij}=q_i^\top k_j/\tau, \qquad p_{ij} = \frac{\exp s_{ij}}{\sum_l\exp s_{il}}, \qquad \mathcal{L}_i=-\log p_{ii}. \]

则

\[ \frac{\partial \mathcal{L}_i}{\partial s_{ij}} = p_{ij}-\mathbb{1}[j=i]. \]

如果某个 negative \(j\) 与 query 很相似，\(p_{ij}\) 大，它收到更强的推开梯度；如果 negative 很容易，\(p_{ij}\approx 0\)，几乎不贡献训练信号。温度 \(\tau\) 越小，softmax 越尖锐，hard negatives 的梯度越集中；温度太小又容易让 false negative 造成强烈错误梯度。

Derivation: InfoNCE as Cross Entropy over Batch Indices

对每个 query \(q_i\)，把所有 keys \(k_1,\dots,k_B\) 看成 \(B\) 个类别，正确类别是 \(i\)。logits 是 \(s_{ij}=q_i^\top k_j/\tau\)。InfoNCE 就是

\[ \mathcal{L}_i = -\sum_j\mathbb{1}[j=i]\log p_{ij}. \]

因此它的 logit 梯度和普通 CE 完全一样：

\[ \partial \mathcal{L}_i/\partial s_{ij} = p_{ij}-\mathbb{1}[j=i]. \]

分布式训练还会改变 negatives。若每张 GPU batch size 是 \(B_{\text{local}}\)，world size 是 \(R\)，all_gather 后每个 query 的 candidate keys 变成 \(RB_{\text{local}}\) 个。有效任务变难，loss 数值和梯度尺度都会变化。代码上要明确三件事：

q = F.normalize(q, dim=-1)
k = F.normalize(k, dim=-1)
k_all = all_gather_with_grad(k)  # or gather without grad depending on design
logits = q @ k_all.T / tau
target = rank * local_bsz + torch.arange(local_bsz, device=q.device)
loss = F.cross_entropy(logits, target)

all_gather 是否保留 gradient 是一个范式选择。若只让本卡 query 更新本卡 encoder，而 gathered negatives stop-gradient，训练仍然可用，但和 fully symmetric contrastive objective 不完全相同。CLIP-style image-text training 通常还会同时算 image-to-text 和 text-to-image 两个方向：

\[ \mathcal{L} = \frac{1}{2} \left( \operatorname{CE}(S,\operatorname{arange}(B)) + \operatorname{CE}(S^\top,\operatorname{arange}(B)) \right). \]

Distillation

Distillation 使用 teacher model 的输出监督 student。若 teacher 给出分布 \(q_T(y\mid x)\)，student 给出 \(p_\theta(y\mid x)\)，常用 KL：

\[ \mathcal{L}_{\text{KD}} = T^2 \operatorname{KL} \left( q_T^{(T)}(\cdot\mid x) \Vert p_\theta^{(T)}(\cdot\mid x) \right), \]

其中 temperature-softened distribution 为

\[ p^{(T)}(y\mid x) = \operatorname{softmax}(z_y/T). \]

temperature 越大，分布越软，student 能看到类别之间的相似结构，而不仅是 hard label。

前面的 \(T^2\) 不是装饰。设 student logits 为 \(z\)，temperature distribution 为 \(p^{(T)}=\operatorname{softmax}(z/T)\)。KL 对 logits 的梯度大约带一个 \(1/T\)，而 softmax 变软后 \(p^{(T)}-q^{(T)}\) 的差异也通常随 \(1/T\) 缩小，所以梯度量级近似按 \(1/T^2\) 下降。乘 \(T^2\) 是为了让不同 temperature 下的 distillation loss 保持可比较的梯度尺度。

Definition: Token-Level vs. Sequence-Level Distillation

Token-level distillation matches teacher distributions at each position. Sequence-level distillation trains on complete teacher-generated outputs. The former transfers local uncertainty; the latter transfers decoding behavior and style.

Definition: Knowledge Distillation

Knowledge distillation trains a student model to match a teacher model’s outputs, intermediate representations, preferences, or generated data, transferring behavior rather than only ground-truth labels.

LLM distillation 有很多形态：

Signal	Example
logits	match teacher token distribution
final answer	SFT on teacher-generated responses
chain-of-thought	imitate reasoning traces
preference	teacher judges chosen/rejected
process feedback	teacher grades intermediate steps

不同 distillation signal 会复制不同能力，也会复制不同偏差。

LLM 蒸馏尤其要区分三种数据：

Distillation data	What is copied	Main risk
teacher logits on human text	token uncertainty	expensive logits, tokenizer mismatch
teacher-generated answers	answer style and format	hallucination copied as label
teacher preference/verifier score	ranking behavior	judge bias and reward hacking

如果 teacher 和 student tokenizer 不同，token-level KL 甚至没有同一个 event space。常见折中是 sequence-level distillation：让 teacher 生成完整回答，再把它当 SFT 数据。这样实现简单，但 student 只能看到 teacher 最终选择的 token，而看不到“哪些备选 token 也还合理”的暗信息。

Adversarial Learning

GAN 已经展示了 adversarial learning：训练信号来自另一个模型。它适合无法直接写出精确 likelihood、但可以训练 evaluator 的场景。

Adversarial learning 的统一形式是：

\[ \min_{\theta} \max_{\phi} \mathbb{E}_{x\sim p_{\text{data}}}A_\phi(x) + \mathbb{E}_{z\sim p(z)}B_\phi(G_\theta(z)). \]

其中 evaluator/discriminator/critic 提供训练信号。它的风险是 evaluator 自身会被 exploited：generator 学到的是骗过当前 evaluator，而不一定是真实质量。

这和 LLM reward hacking 是同一个结构问题：一旦 evaluator 变成 optimization target，它就可能被模型钻空子。

Preference and Policy Optimization

LLM post-training 中常见链条是：

SFT: 用人工或合成 demonstrations 做 supervised fine-tuning；
preference modeling: 学习 reward 或 preference；
DPO/IPO/ORPO: 直接用偏好对优化 policy；
PPO/GRPO/RL: 用采样、reward 和 KL 控制做在线优化；
OPD: 让 student 在自己访问到的 states 上接受 teacher supervision。

Pitfall: More Advanced Training Is Not Always Better

If SFT solves the behavior, do not rush to RL. If offline preference optimization is enough, do not rush to online PPO-style training. More interactive objectives bring more variance, reward hacking risk, and infrastructure cost.

Preference Data as Constructed Supervision

Preference 数据通常是三元组：

\[ (x,y^+,y^-), \]

其中 \(y^+\) 比 \(y^-\) 更受偏好。Bradley-Terry model 写成：

\[ P(y^+\succ y^-\mid x) = \sigma(r_\phi(x,y^+)-r_\phi(x,y^-)). \]

reward model loss:

\[ \mathcal{L}_{\text{RM}} = - \log \sigma(r_\phi(x,y^+)-r_\phi(x,y^-)). \]

DPO 则不显式训练 reward model，而是用 policy 和 reference policy 的 log-prob ratio 构造偏好目标：

\[ \mathcal{L}_{\text{DPO}} = - \log\sigma \left( \beta \left[ \log\frac{\pi_\theta(y^+\mid x)}{\pi_{\text{ref}}(y^+\mid x)} - \log\frac{\pi_\theta(y^-\mid x)}{\pi_{\text{ref}}(y^-\mid x)} \right] \right). \]

这说明 preference optimization 的 target 不是“正确答案 token”，而是“相对更好的完整回答”。

在实现上，DPO 需要先把 response token log-prob 聚合成 sequence log-prob：

\[ \log\pi_\theta(y\mid x) = \sum_{t\in \mathcal{R}(y)} \log\pi_\theta(y_t\mid x,y_{<t}), \]

其中 \(\mathcal{R}(y)\) 只包含 response token，不包含 prompt token。若 chosen 和 rejected 长度差异很大，raw sum 会带来 length bias；若改用平均 log-prob，又改变了 DPO 原始事件的概率语义。严谨做法是记录 response length，审计 chosen/rejected 长度分布，并在使用 length normalization 时明确写出目标已变。

preference record 至少应保存：

prompt
chosen
rejected
generator_policy
sampling_temperature
sampling_top_p
judge_source
template_version
tokenizer_version

缺少这些字段时，后续很难判断 DPO 到底在修正模型偏好，还是在拟合某个旧 generator 的采样偏差。

Pitfall: Preference Learning Depends on Candidate Generation

Preference data are not only labels; they depend on how candidate responses were sampled. Changing the generator changes the distribution of comparisons.

On-Policy vs. Offline Training

训练范式还可以按 data 是否来自当前模型分成 offline 和 on-policy。

Paradigm	Data source	Risk
SFT	fixed demonstrations	distribution mismatch
DPO/IPO	fixed preference pairs	stale negative samples
PPO/GRPO	current policy samples	high variance, reward hacking
Rejection sampling	model samples filtered by reward	mode narrowing
OPD-like distillation	student states + teacher feedback	teacher quality and coverage

Offline 训练稳定、便宜、可复现；on-policy 训练更贴近模型真实访问到的 states，但成本和方差更高。

Definition: On-Policy Data

On-policy data are generated by the current model or policy being optimized. This makes training distribution track the model’s actual behavior, but introduces sampling cost and instability.

这个区别在 LLM post-training 里非常实际：SFT 的数据分布通常来自人类或强模型；DPO 的 rejected samples 常来自某个旧 policy；RL 的 samples 来自当前 policy。它们优化的状态分布不同。

Curriculum and Multi-Stage Training

现代大模型不是一次训练完成，而是多阶段：

pretraining -> continued pretraining -> SFT -> preference optimization -> evaluation/repair

每个阶段改变数据分布和目标函数：

Stage	Objective	What changes
pretraining	next-token CE	broad world/modeling ability
continued pretraining	next-token CE on domain data	domain knowledge/style
SFT	completion CE	instruction-following format
preference optimization	pairwise or RL objective	helpfulness/safety/style
repair	targeted SFT/evals	specific failure modes

如果阶段顺序错了，后果很明显：太大的 SFT LR 会破坏 base model；preference data 太窄会让模型 over-optimize style；continued pretraining 后不重新 SFT 可能损伤 instruction following。

多阶段训练还经常使用 mixture。假设有 \(K\) 个数据源，每个数据源分布为 \(p_k(x)\)，采样权重为 \(\alpha_k\)，则 pretraining 实际优化：

\[ \mathcal{L}(\theta) = \sum_{k=1}^{K} \alpha_k \mathbb{E}_{x\sim p_k} [\ell_\theta(x)]. \]

改变 \(\alpha_k\) 就是在改变目标函数本身，而不是单纯“换一批数据”。常见的 temperature sampling 会把语料大小 \(n_k\) 转成：

\[ \alpha_k = \frac{n_k^\gamma}{\sum_j n_j^\gamma}. \]

当 \(\gamma=1\) 时按 token 数采样，大语料支配训练；当 \(\gamma<1\) 时，小语料被上采样。这样可以增强低资源语言或领域数据，但也会增加重复采样和过拟合风险。

Pitfall: Sampling Weights Are Loss Weights

In expectation, changing dataset sampling probabilities changes the weighted training objective. Mixture weights must be versioned like hyperparameters, not treated as dataloader trivia.

Implementation: Dataset Is Part of the Objective

在代码里，一个训练范式通常落在 dataset/collator 上，而不是 model class 上：

def build_ar_example(tokens, block_size):
    x = tokens[:block_size]
    y = tokens[1:block_size + 1]
    return {"input_ids": x, "labels": y}


def build_masked_example(tokens, mask_id, p):
    input_ids = tokens.copy()
    labels = [-100] * len(tokens)
    for i, tok in enumerate(tokens):
        if random.random() < p:
            input_ids[i] = mask_id
            labels[i] = tok
    return {"input_ids": input_ids, "labels": labels}

同一个 Transformer 接收的字段名可能都叫 input_ids 和 labels，但 label mask、attention mask、position ids、sample weights 已经定义了不同训练问题。

一个更接近真实训练代码的 collator 会显式返回所有 contract tensor：

def collate_completion(batch, tokenizer, max_len):
    input_ids = []
    labels = []
    loss_mask = []

    for ex in batch:
        prompt = tokenizer.apply_chat_template(ex["messages"][:-1])
        answer = tokenizer.encode(ex["messages"][-1]["content"]) + [tokenizer.eos_token_id]
        ids = (prompt + answer)[:max_len]

        y = [-100] * len(ids)
        start = min(len(prompt), len(ids))
        for i in range(start, len(ids)):
            y[i] = ids[i]

        m = [int(v != -100) for v in y]
        input_ids.append(ids)
        labels.append(y)
        loss_mask.append(m)

    input_ids, attention_mask = pad(input_ids, tokenizer.pad_token_id)
    labels, _ = pad(labels, -100)
    loss_mask, _ = pad(loss_mask, 0)
    position_ids = attention_mask.long().cumsum(dim=-1).sub(1).clamp_min(0)
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "position_ids": position_ids,
        "labels": labels,
        "loss_mask": loss_mask,
    }

这个函数里每个字段都有语义：

Tensor	Meaning
`input_ids`	conditioning sequence seen by the model
`attention_mask`	which tokens can be read as real context
`position_ids`	positional coordinates of real tokens
`labels`	target ids or `-100` ignore positions
`loss_mask`	auditable view of effective supervised positions

训练 step 也应保留 denominator，而不是只拿 framework 默认 mean：

logits = model(**batch).logits
losses = F.cross_entropy(
    logits[:, :-1].reshape(-1, vocab_size),
    batch["labels"][:, 1:].reshape(-1),
    ignore_index=-100,
    reduction="none",
).view(batch_size, -1)
mask = batch["labels"][:, 1:].ne(-100)
loss = (losses * mask).sum() / mask.sum().clamp_min(1)

如果 model forward 已经内部 shift，这段代码就不该再 [:, :-1] 和 [:, 1:]。所以工程上最好给 collator 和 model 写一个小批量 golden test：用手工构造的 5 个 token，逐位置确认哪个 logit 对哪个 label。

训练前应检查：

每个 batch 的 effective target count；
ignore_index 是否只忽略该忽略的位置；
prompt/completion loss mask 是否正确；
positive/negative pairs 是否泄漏；
preference pairs 是否来自预期 policy；
sampling temperature/top-p 是否记录；
curriculum stage 的数据比例是否可复现。

Pitfall: A Silent Collator Bug Can Change the Paradigm

If a collator shifts labels incorrectly, masks prompt tokens, leaks future tokens, or samples wrong negatives, the model may train smoothly while learning the wrong task.

Minimal Smoke Tests

训练范式的测试应该比“跑通一个 batch”更具体：

Paradigm	Smoke test
AR LM	one-token target manually matches shifted label
SFT	prompt labels are all `-100`, assistant labels are not
packed LM	later document cannot attend to earlier document unless intended
masked LM	unmasked tokens have `-100` labels; masked positions target clean ids
contrastive	diagonal pair is target after distributed gather
distillation	teacher/student vocab and temperature match
preference	chosen/rejected use identical prompt/template and response mask
curriculum	sampled mixture proportions match configured weights

一个极小的检查可以救很多训练时间：

def assert_has_targets(batch):
    n = batch["labels"].ne(-100).sum().item()
    if n == 0:
        raise ValueError("batch has no supervised target tokens")


def assert_prompt_masked(labels, prompt_len):
    if labels[:prompt_len].ne(-100).any():
        raise ValueError("prompt tokens contribute to completion loss")

这些测试看起来朴素，但它们对应的是目标函数是否被正确实现。深度学习系统中最隐蔽的错误，往往不是模型 shape 报错，而是训练了一个和自己以为不同的任务。

A Useful Map

Data type	Common objective	Representative model
image pixels	denoising / score matching	DDPM, flow matching
text tokens	autoregressive CE	GPT
masked tokens	denoising CE	BERT, LLaDA-style MDM
graph edges/nodes	message passing + CE/link loss	GNN
paired modalities	contrastive loss	CLIP
human preferences	pairwise preference loss	DPO/RLHF
generated samples	adversarial loss	GAN

这张表比“神经网络结构列表”更重要：同一个 Transformer 可以用 AR、masked denoising、contrastive、SFT、DPO、RL 等方式训练。结构决定可表达函数族，训练范式决定模型学到什么行为。