3.4 Regularization

Regularization 的目标不是让训练 loss 更低，而是让模型学到更稳健的函数。它通过约束参数、扰动训练过程、限制有效容量或注入先验，改变 optimization 找到的解。

Bias-Variance and Capacity

模型过小会 high bias，模型过大则可能 high variance。现代深度学习的经验更微妙：overparameterized 模型未必泛化差，关键在于训练动力学和 implicit regularization 是否把模型带到好的解。

Definition: Regularization

Regularization is any mechanism that biases learning toward solutions expected to generalize better, either by modifying the objective, the model class, the data distribution, or the optimization process.

把 empirical risk 写成

\[ \hat R(\theta) = \frac1n\sum_{i=1}^{n}\ell(f_\theta(x_i),y_i), \]

regularization 通常有四种入口：

entry point	mathematical form	example
objective	\(\hat R(\theta)+\lambda\Omega(\theta)\)	L2, L1, KL penalty
data distribution	\(\mathbb{E}_{\tilde{x},\tilde{y}}\ell(f_\theta(\tilde{x}),\tilde{y})\)	augmentation, mixup
model computation	stochastic or constrained forward pass	dropout, stochastic depth
optimization path	restrict which solution training reaches	early stopping, SGD noise

这四类经常混在一起。比如 LoRA 既是模型参数化约束，也是 fine-tuning 的 regularization；AdamW 的 weight decay 是 objective bias 和 optimizer implementation 的交界；dropout 既是 stochastic computation，也近似一个 ensemble。

Definition: Explicit and Implicit Regularization

Explicit regularization changes the training objective or computation directly, such as adding \(\lambda\Omega(\theta)\) or applying dropout. Implicit regularization comes from the optimizer, architecture, data order, or stopping rule even when the written loss is unchanged.

Weight Decay

L2 regularization:

\[ \min_\theta \hat{R}(\theta)+\frac{\lambda}{2}\|\theta\|_2^2. \]

在 SGD 下，它等价于

\[ \theta_{t+1} = (1-\eta\lambda)\theta_t -\eta\nabla \hat{R}(\theta_t). \]

这说明 weight decay 不只是 penalty，也是在每步把参数往原点收缩。

Proof

带 L2 penalty 的目标为

\[ J(\theta)=\hat{R}(\theta)+\frac{\lambda}{2}\|\theta\|_2^2. \]

梯度是

\[ \nabla J(\theta) = \nabla\hat{R}(\theta)+\lambda\theta. \]

SGD 更新：

\[ \theta_{t+1} = \theta_t-\eta(\nabla\hat{R}(\theta_t)+\lambda\theta_t) = (1-\eta\lambda)\theta_t-\eta\nabla\hat{R}(\theta_t). \]

在 AdamW 中，weight decay 通常 decoupled：

\[ \theta_{t+1} = (1-\eta\lambda)\theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}. \]

它和“把 \(\lambda\theta\) 加进 Adam gradient”不同，因为后者会被 adaptive denominator 按坐标缩放。直觉上，decoupled weight decay 是对参数本身做统一收缩，而不是把正则项混入每个坐标的自适应梯度统计。

Pitfall: Do Not Decay Every Parameter Blindly

In Transformer training, biases and LayerNorm scale parameters are often excluded from weight decay. Decaying normalization parameters can harm stability because they control activation scale rather than represent ordinary feature weights.

Bayesian View of L2

L2 penalty 也可以从 MAP estimation 理解。假设监督数据 likelihood 为

\[ p(\mathcal{D}\mid \theta) = \prod_{i=1}^{n}p(y_i\mid x_i,\theta), \]

并给参数一个 isotropic Gaussian prior：

\[ p(\theta) \propto \exp\left(-\frac{1}{2\sigma^2}\|\theta\|_2^2\right). \]

最大化 posterior 等价于最小化 negative log posterior：

\[ -\log p(\theta\mid \mathcal{D}) = -\sum_i\log p(y_i\mid x_i,\theta) + \frac{1}{2\sigma^2}\|\theta\|_2^2 + \text{const}. \]

因此 \(\lambda\) 可以看成 prior strength。大的 \(\lambda\) 表示强烈相信参数应接近 0；小的 \(\lambda\) 表示数据可以更自由地决定参数。

Definition: MAP Estimation

Maximum a posteriori estimation chooses \[ \theta_{\text{MAP}}=\arg\max_\theta p(\theta\mid\mathcal{D}) = \arg\max_\theta p(\mathcal{D}\mid\theta)p(\theta). \] Regularizers can often be interpreted as negative log priors.

这个视角有用，但不能机械套用到所有 deep learning 设置。神经网络有 scale symmetries，例如 ReLU 网络中一层乘以 \(c\)、下一层除以 \(c\) 可能表示同一个函数；参数范数不是函数复杂度的完美度量。所以 weight decay 是非常实用的 bias，但不是“泛化好坏”的完整解释。

L1, Sparsity, and Proximal Updates

L1 regularization 写作

\[ \min_\theta \hat R(\theta)+\lambda\|\theta\|_1. \]

它倾向于产生稀疏参数。直觉上，L2 的梯度是 \(\lambda\theta\)，参数越小收缩越弱；L1 的 subgradient 近似常数，参数靠近 0 时仍然被推向 0。

Definition: Subgradient of L1

For one coordinate, \[ \partial |w|= \begin{cases} \{1\},&w>0,\\ [-1,1],&w=0,\\ \{-1\},&w<0. \end{cases} \]

在 proximal gradient 中，先对 smooth loss 做一步梯度下降：

\[ u=w-\eta\nabla \hat R(w), \]

再解

\[ w^+ = \arg\min_v \frac{1}{2\eta}(v-u)^2+\lambda |v|. \]

解是 soft-thresholding：

\[ w^+ = \operatorname{sign}(u)\max(|u|-\eta\lambda,0). \]

Proof

对 \(v>0\)，目标导数为

\[ \frac{1}{\eta}(v-u)+\lambda=0, \]

所以 \(v=u-\eta\lambda\)，且需要 \(u>\eta\lambda\)。对 \(v<0\)，

\[ \frac{1}{\eta}(v-u)-\lambda=0, \]

所以 \(v=u+\eta\lambda\)，且需要 \(u<-\eta\lambda\)。若 \(|u|\le\eta\lambda\)，最优点落在不可导点 \(v=0\)，因为 \(0\in \frac{1}{\eta}(0-u)+\lambda[-1,1]\)。

深度网络里很少直接对所有权重用 L1，因为稀疏参数不一定带来实际推理加速，非结构化稀疏还需要专门 kernel。但在 feature selection、adapter pruning、gating 或 structured sparsity 中，L1/proximal 思想很重要。

Dropout

Dropout 在训练时随机 mask hidden activations：

\[ \tilde{h} = \frac{m\odot h}{1-p}, \qquad m_i\sim \operatorname{Bernoulli}(1-p). \]

缩放因子 \(1/(1-p)\) 保持期望不变：

\[ \mathbb{E}[\tilde{h}]=h. \]

Dropout 可以看成近似 ensemble：每个 batch 都训练一个子网络，测试时使用平均化后的网络。

Dropout Variance

虽然 inverted dropout 保持期望不变，但它会增加方差。对单个 activation：

\[ \tilde{h}=\frac{m h}{1-p}, \qquad m\sim\operatorname{Bernoulli}(1-p). \]

则

\[ \mathbb{E}[\tilde{h}]=h. \]

二阶矩为

\[ \mathbb{E}[\tilde{h}^2] = \frac{h^2}{(1-p)^2}\mathbb{E}[m] = \frac{h^2}{1-p}. \]

所以

\[ \operatorname{Var}(\tilde{h}) = \frac{h^2}{1-p}-h^2 = \frac{p}{1-p}h^2. \]

dropout 注入的是 multiplicative noise。它能 regularize，但也会让训练信号更 noisy。现代 LLM pretraining 中 dropout 常很小甚至为 0，原因是大规模数据本身已经提供强正则；小数据 fine-tuning 时 dropout 才更常见。

Pitfall: Dropout Changes Train/Eval Semantics

Forgetting model.eval() during validation keeps dropout active and makes metrics noisy. Forgetting model.train() after validation disables dropout in training.

Dropout as Noise Regularization

对线性模型可以看得更清楚。设 prediction 为

\[ \hat y=w^\top \tilde{x}, \qquad \tilde{x}_j=\frac{m_j x_j}{1-p}. \]

平方损失的 dropout 期望为

\[ \mathbb{E}_m[(y-w^\top\tilde{x})^2]. \]

因为 \(\mathbb{E}[\tilde{x}]=x\)，交叉项的期望保持原样；额外项来自 noise variance：

\[ \mathbb{E}_m[(y-w^\top\tilde{x})^2] = (y-w^\top x)^2 + \sum_j w_j^2\operatorname{Var}(\tilde{x}_j). \]

而

\[ \operatorname{Var}(\tilde{x}_j)=\frac{p}{1-p}x_j^2. \]

所以 dropout 近似加入 data-dependent L2 penalty：

\[ \frac{p}{1-p}\sum_j w_j^2x_j^2. \]

Proof

令 \(\epsilon=\tilde{x}-x\)，则 \(\mathbb{E}[\epsilon]=0\)，

\[ y-w^\top\tilde{x} = y-w^\top x-w^\top\epsilon. \]

平方并取期望：

\[ \mathbb{E}[(a-w^\top\epsilon)^2] = a^2 -2a\,\mathbb{E}[w^\top\epsilon] + \mathbb{E}[(w^\top\epsilon)^2], \]

其中 \(a=y-w^\top x\)。由于 mask 独立，最后一项为

\[ \sum_jw_j^2\operatorname{Var}(\tilde{x}_j). \]

这说明 dropout 不是魔法 ensemble，而是“对特征依赖太强”的惩罚。activation 越大、dropout rate 越高，噪声越强。也因此在 residual branch、attention probability、embedding dropout 等位置使用 dropout 时，实际影响并不一样。

常见变体：

variant	where noise is applied	typical reason
activation dropout	hidden activations	reduce co-adaptation
attention dropout	attention probabilities	prevent brittle token links
embedding dropout	token embeddings	robust lexical features
stochastic depth	residual blocks	regularize very deep nets
variational dropout	same mask across time	stable RNN regularization

RNN 中如果每个 time step 都重新采样 dropout mask，会给时间方向注入强噪声；variational dropout 常固定同一序列内的 mask，让模型看到一致的子网络。

Data Augmentation

对图像，augmentation 包括 crop、flip、color jitter、mixup、cutmix。对文本，augmentation 更微妙，因为 token 改动可能改变语义。对语音和多模态，常见扰动包括 time masking、frequency masking、resolution jitter。

Definition: Vicinal Risk Minimization

Instead of training only on empirical samples \((x_i,y_i)\), vicinal risk minimization trains on samples drawn from neighborhoods around training examples: \[ \min_\theta \frac{1}{n}\sum_i \mathbb{E}_{\tilde{x},\tilde{y}\sim \nu(\cdot\mid x_i,y_i)} \ell(f_\theta(\tilde{x}),\tilde{y}). \]

Mixup 是典型例子：

\[ \tilde{x}=\lambda x_i+(1-\lambda)x_j, \qquad \tilde{y}=\lambda y_i+(1-\lambda)y_j. \]

其中

\[ \lambda\sim\operatorname{Beta}(\alpha,\alpha). \]

\(\alpha\) 控制混合强度。\(\alpha\) 很小时，\(\lambda\) 多接近 0 或 1，样本更像原始样本；\(\alpha\) 较大时，样本更常落在两类之间，regularization 更强。

Mixup 的 loss 是 soft-label cross entropy：

\[ \ell(f(\tilde{x}),\tilde{y}) = \lambda\ell(f(\tilde{x}),y_i) + (1-\lambda)\ell(f(\tilde{x}),y_j). \]

CutMix 则把输入区域替换而不是整体线性插值：

\[ \tilde{x}=M\odot x_i+(1-M)\odot x_j, \]

label 按面积比例混合：

\[ \tilde{y}=\rho y_i+(1-\rho)y_j. \]

在图像中，CutMix 比 mixup 更保留局部纹理；在文本中，直接 token mix 往往会破坏语法或语义，所以更常用 span corruption、back-translation、prompt paraphrase 或数据筛选。

Augmentation as Invariance

augmentation 的本质是告诉模型：某些变换不应该改变标签。若 \(g\) 是一个 label-preserving transformation，例如图像水平翻转，则理想模型满足

\[ f_\theta(gx)\approx f_\theta(x). \]

训练 augmented samples 等价于在经验风险中加入变换分布：

\[ \hat{R}_{\text{aug}}(\theta) = \frac1n \sum_i \mathbb{E}_{g\sim\mathcal{G}} \ell(f_\theta(gx_i),y_i). \]

这比单纯扩大数据集更有意义：它把先验 symmetry 写进训练分布。错误的 augmentation 会写入错误先验，比如医学影像中左右翻转可能改变语义，文本中同义替换也可能破坏细粒度标签。

Pitfall: Augmentation Must Preserve the Label

An augmentation is regularization only if it preserves the target semantics. Otherwise it injects label noise.

如果希望模型不仅对有限增强样本正确，而是真的局部平滑，可以加入 consistency regularization：

\[ \mathcal{L}_{\text{cons}} = \mathbb{E}_{x,g_1,g_2} D\left( p_\theta(\cdot\mid g_1x), p_\theta(\cdot\mid g_2x) \right), \]

其中 \(D\) 可以是 KL divergence 或 mean squared distance。半监督学习里的 FixMatch / Mean Teacher 一类方法，本质上就是把 augmentation invariance 变成显式 loss。

Label Smoothing

对于 \(K\) 类分类，one-hot label \(y\) 替换为

\[ y_k^{\text{smooth}} = (1-\epsilon)y_k+\frac{\epsilon}{K}. \]

它防止模型把概率全部压到一个类别上，减少 overconfidence。对于语言模型，label smoothing 曾经常用，但现代 LLM pretraining 更常直接使用标准 cross entropy，因为 token distribution 本身已经很大且长尾。

设 logits 为 \(z\)，softmax 为 \(p\)，cross entropy 为

\[ L=-\sum_k y_k^{\text{smooth}}\log p_k. \]

对 logits 的梯度是

\[ \frac{\partial L}{\partial z_k} = p_k-y_k^{\text{smooth}}. \]

one-hot 时，真实类别的梯度是 \(p_y-1\)；label smoothing 后变成

\[ p_y-\left(1-\epsilon+\frac{\epsilon}{K}\right). \]

它不会把真实类别推到概率 1，而是让最优预测保留少量 entropy。

Definition: Confidence Penalty

A confidence penalty regularizes a classifier by discouraging low-entropy output distributions. Label smoothing is one practical way to impose such a bias through softened targets.

Label smoothing 也可以写成和 uniform distribution 的混合：

\[ y^{\text{smooth}} = (1-\epsilon)y+\epsilon u, \qquad u_k=\frac1K. \]

因此 cross entropy 分解为

\[ H(y^{\text{smooth}},p) = (1-\epsilon)H(y,p)+\epsilon H(u,p). \]

第二项会惩罚模型对 uniform 分布的 cross entropy 过大，也就是不允许输出分布太尖。

还可以从 KL 看：

\[ H(u,p) = H(u)+\operatorname{KL}(u\Vert p). \]

因为 \(H(u)\) 是常数，label smoothing 等价于给 loss 加上 \(\epsilon\operatorname{KL}(u\Vert p)\)。注意方向是 \(\operatorname{KL}(u\Vert p)\)，它强烈惩罚某些类别概率接近 0。

Pitfall: Label Smoothing Can Hurt Distillation Signals

If labels already contain informative soft probabilities, extra smoothing can erase dark knowledge. In distillation or preference data, check whether the target distribution is already calibrated before smoothing again.

实现时要注意 padding：

loss = F.cross_entropy(
    logits.transpose(1, 2),
    targets,
    ignore_index=pad_id,
    label_smoothing=0.1,
)

需要确认 ignore_index 的 token 不参与 smoothing 后的平均，否则 PAD 会偷偷贡献 uniform target。不同框架或自定义 CE 实现这里很容易出错。

Early Stopping

Early stopping 根据 validation loss 停止训练。它可以看作对训练时间的 regularization：训练越久，模型越能拟合训练集细节；在 validation 开始变坏前停下，相当于限制模型利用全部容量。

在线性模型的梯度下降中，early stopping 与 L2 regularization 有相似效果。沿 Hessian eigen-direction \(\lambda_i\)，从零初始化开始，GD 迭代可写成一个逐渐接近最小二乘解的滤波过程：

\[ \theta_{t,i} = \left(1-(1-\eta\lambda_i)^t\right)\theta_i^\star. \]

小特征值方向收敛慢，早停会抑制这些方向；而这些方向往往对应高方差、不稳定的拟合。

Proof Sketch

对二次目标，每个 eigen-direction 独立：

\[ \theta_{t+1,i} = \theta_{t,i} -\eta\lambda_i(\theta_{t,i}-\theta_i^\star). \]

令误差 \(e_{t,i}=\theta_{t,i}-\theta_i^\star\)，则

\[ e_{t+1,i}=(1-\eta\lambda_i)e_{t,i}. \]

若 \(\theta_{0,i}=0\)，则 \(e_{0,i}=-\theta_i^\star\)，所以

\[ \theta_{t,i} = \theta_i^\star+e_{t,i} = \left(1-(1-\eta\lambda_i)^t\right)\theta_i^\star. \]

Ridge regression 的 filter factor 是

\[ \theta_i^{\text{ridge}} = \frac{\lambda_i}{\lambda_i+\alpha}\theta_i^\star. \]

Early stopping 的 filter factor 是

\[ 1-(1-\eta\lambda_i)^t. \]

二者都更快保留大 \(\lambda_i\) 方向，抑制小 \(\lambda_i\) 方向，所以会表现出相似的 regularization effect。但它们不是完全等价：early stopping 的强度依赖优化器、学习率、batch noise 和 validation protocol。

实用 early stopping 至少要定义：

choice	why it matters
monitored metric	loss、accuracy、FID、reward 可能不同步
patience	避免 validation noise 触发过早停止
min delta	忽略统计噪声级别内的小波动
checkpoint rule	保存 best validation 而不是 last step
validation frequency	太频繁浪费，太稀疏错过过拟合点

if val_loss < best_loss - min_delta:
    best_loss = val_loss
    bad_steps = 0
    save_checkpoint(model, optimizer, scheduler)
else:
    bad_steps += 1

if bad_steps >= patience:
    stop_training = True

对大模型训练，early stopping 往往不是主要机制，因为预训练通常按 token budget 跑；但在小数据 fine-tuning、reward model、classifier head、adapter tuning 中，它仍然非常关键。

Implicit Regularization

SGD、batch size、learning rate、architecture 都会产生 implicit bias。即使没有显式 penalty，训练也倾向于某些解。

Mechanism	Regularization effect
SGD noise	favors flatter basins under some regimes
Weight sharing	reduces degrees of freedom
Convolution	encodes translation equivariance
Attention mask	encodes autoregressive factorization
Low-rank adapters	restrict update subspace

现代大模型里，regularization 往往不是单个 trick，而是 data scale、architecture、optimizer、LR schedule 和 post-training 共同形成的系统性偏置。

一个经典例子是线性可分数据上的 logistic regression。即使没有显式 L2 penalty，gradient descent 的参数范数会持续增长，但方向会趋向最大 margin 解。

Theorem: Gradient Descent Bias Toward Max-Margin Direction

For separable linear classification with exponential-type losses, gradient descent can drive parameter norms to infinity while the normalized direction converges toward a max-margin separator under suitable conditions.

这里不展开完整证明，但直觉是：当数据已经被正确分类后，loss 主要由 margin 最小的样本贡献；优化继续降低 loss 时，方向被这些 support-like samples 主导，逐渐扩大最小 margin。这个现象解释了为什么“训练 loss 继续接近 0”不一定等同于泛化变差。

这也提醒我们：regularization 不能只看 objective 有没有 penalty，还要看 optimizer 找到了哪个解。

Regularization in Fine-Tuning

Fine-tuning 的 regularization 目标通常不是“防止从零训练过拟合”，而是“不要破坏 pretrained distribution”。常见手段：

Mechanism	What it protects
small LR	pretrained weights
weight decay only on matrix weights	representation scale
dropout or data augmentation	small task generalization
LoRA rank	update subspace
KL penalty	behavior close to reference model
early stopping	overfitting small instruction set

在 preference optimization 中，KL regularization 常写为

\[ \mathbb{E}_{x,y} \left[ r_\phi(x,y) - \beta \operatorname{KL} (\pi_\theta(\cdot\mid x)\Vert\pi_{\text{ref}}(\cdot\mid x)) \right]. \]

这里 regularization 不是来自参数范数，而是来自策略分布不要偏离 reference model 太远。

如果把 KL 展开到 token 级，常见近似是

\[ \operatorname{KL} (\pi_\theta(\cdot\mid x)\Vert\pi_{\text{ref}}(\cdot\mid x)) = \mathbb{E}_{y\sim\pi_\theta} \left[ \log\pi_\theta(y\mid x)-\log\pi_{\text{ref}}(y\mid x) \right]. \]

在 RLHF/PPO 中，这项通常作为 reward penalty：

\[ r_{\text{total}}(x,y) = r_\phi(x,y) - \beta\left( \log\pi_\theta(y\mid x)-\log\pi_{\text{ref}}(y\mid x) \right). \]

它惩罚“当前策略比 reference 更偏爱这个输出”的程度。\(\beta\) 太大，模型几乎不动；\(\beta\) 太小，模型容易 reward hacking 或语言质量漂移。

Fine-tuning 时还常见几种结构性 regularization：

method	regularization mechanism	failure if too strong
freeze backbone	only head/adapters move	underfit task
LoRA low rank	update lies in low-rank subspace	insufficient capacity
small LR	stay near pretrained basin	slow/no adaptation
KL to reference	preserve behavior distribution	refuses task shift
replay data	prevent forgetting	extra data/compute cost

Practical Parameter Groups

Transformer 训练常用两个 parameter groups：

decay = []
no_decay = []

for name, p in model.named_parameters():
    if not p.requires_grad:
        continue
    if name.endswith(".bias") or "norm" in name.lower():
        no_decay.append(p)
    else:
        decay.append(p)

optimizer = torch.optim.AdamW(
    [
        {"params": decay, "weight_decay": 0.1},
        {"params": no_decay, "weight_decay": 0.0},
    ],
    lr=3e-4,
)

这段代码背后的原则是：matrix weights 承担主要函数复杂度，适合 decay；bias 和 norm scale 更像校准量，通常不 decay。

更稳妥的写法是按参数维度和模块名一起判断，并检查没有遗漏：

decay = set()
no_decay = set()

for name, p in model.named_parameters():
    if not p.requires_grad:
        continue
    if p.ndim < 2 or name.endswith(".bias") or "norm" in name.lower():
        no_decay.add(name)
    else:
        decay.add(name)

param_dict = {name: p for name, p in model.named_parameters() if p.requires_grad}
assert not (decay & no_decay)
assert decay | no_decay == set(param_dict)

optimizer = torch.optim.AdamW(
    [
        {"params": [param_dict[n] for n in sorted(decay)], "weight_decay": 0.1},
        {"params": [param_dict[n] for n in sorted(no_decay)], "weight_decay": 0.0},
    ],
    lr=3e-4,
)

Choosing Regularization Strength

Regularization 不是越强越好。调参时可以把每种方法看成一个旋钮：

knob	too small	too large
weight decay	overfit, large norms	underfit, slow adaptation
dropout	co-adaptation	noisy/unstable training
augmentation	memorization	label corruption
label smoothing	overconfidence	under-confident predictions
KL penalty	behavior drift	no task learning
early stopping patience	overfit	stop before convergence

一个实用流程：

先建立无正则或弱正则 baseline；
观察 train/validation gap、calibration、sample quality 或 reward drift；
一次只加一种强正则，确认它改善的是目标问题；
记录 effective batch、LR schedule、weight decay、dropout、augmentation seed；
对 fine-tuning，单独比较 full fine-tune、frozen、LoRA、KL/replay 等策略。

Implementation Checklist

实现 regularization 时可以逐项查：

AdamW 是否使用 decoupled weight decay；
bias、LayerNorm/RMSNorm、embedding 是否按设计 decay；
dropout 是否只在 training mode 生效；
validation 前后是否正确切换 model.eval() 和 model.train()；
augmentation 是否真的 label-preserving；
mixup/cutmix 后 target 是否是 soft label；
label smoothing 是否忽略 PAD/ignore_index；
early stopping 是否保存 best checkpoint；
fine-tuning 是否监控 forgetting 或 KL drift；
LoRA/adapters 的 rank 和 target modules 是否匹配任务容量；
记录的 regularization strength 是否能复现实验；
regularization 改善的是 validation/generalization，而不只是让训练更慢。

两个 smoke tests：

# 1. dropout should be stochastic in train mode and deterministic in eval mode
drop = torch.nn.Dropout(p=0.5)
drop.train()
y1 = drop(x)
y2 = drop(x)
assert not torch.allclose(y1, y2)

drop.eval()
y3 = drop(x)
y4 = drop(x)
assert torch.allclose(y3, y4)

# 2. parameter groups should partition trainable parameters
grouped = {id(p) for g in optimizer.param_groups for p in g["params"]}
trainable = {id(p) for p in model.parameters() if p.requires_grad}
assert grouped == trainable

这些检查很小，但能避免最常见的正则化假象：以为启用了 dropout，其实 eval/train mode 错了；以为设置了 weight decay，其实参数分组漏了或 decay 到了不该 decay 的参数。