2.9 Loss Functions and Objectives

Loss function 不是训练代码里的一个参数。它定义了模型在学习什么概率假设、哪些错误更严重、梯度如何流回 logits、样本如何加权，以及 evaluation metric 与训练目标之间是否一致。

同一个网络结构，换一个 objective，学到的行为可能完全不同。比如 Transformer 可以做 autoregressive LM、masked LM、contrastive retrieval、classification、ranking 或 preference optimization；区别主要不在 block，而在 loss 如何构造。

Empirical Risk

Definition: Empirical Risk

Given examples \(\{(x_i,y_i)\}_{i=1}^{n}\) and loss \(\ell\), empirical risk minimization solves \[ \min_\theta \hat{R}(\theta) = \frac{1}{n}\sum_{i=1}^{n}\ell(f_\theta(x_i),y_i). \]

实现时还要决定 reduction：

loss = criterion(logits, labels)        # mean by default for many PyTorch losses
loss = per_token_loss.mean()            # average over valid tokens
loss = per_sample_loss.sum() / n_tokens # token-normalized

mean 到底是按 sample、token、non-pad token 还是 batch element 平均，会直接改变有效 learning rate。

Pitfall: Reduction Changes the Effective Learning Rate

If one implementation averages over sequences and another averages over tokens, the same nominal learning rate can produce different gradient magnitudes.

Empirical risk 是 population risk 的样本近似。真正想优化的是

\[ R(\theta) = \mathbb{E}_{(x,y)\sim p_{\text{data}}} \ell(f_\theta(x),y), \]

但我们只有有限训练集，所以用

\[ \hat R(\theta) = \frac1n\sum_i \ell(f_\theta(x_i),y_i). \]

这带来两个重要问题：

loss 是否是想要的统计假设；
finite-sample estimate 是否被 masking、sampling、class weighting 改坏。

在深度学习代码里，第二点经常比第一点更容易出错。比如语言模型中同一个 batch 的有效 token 数可能不同，若直接 mean()，每张卡、每个 micro-batch 的 denominator 可能不一致。

Definition: Loss Denominator

The loss denominator is the quantity used to normalize summed per-example or per-token losses. It determines gradient scale and should match the intended unit of optimization: examples, sequences, tokens, or valid masked tokens.

分布式训练里，token-normalized loss 更应该写成全局 numerator / 全局 denominator：

local_loss_sum = (loss_per_token * mask).sum()
local_count = mask.sum()

global_count = local_count.detach().clone()
dist.all_reduce(global_count, op=dist.ReduceOp.SUM)

# DDP averages gradients across ranks, so multiply by world size.
loss = local_loss_sum * dist.get_world_size() / global_count.clamp_min(1)

如果每张卡先各自除以本地 token 数，再平均梯度，长短样本在不同卡上的权重会改变。这类 bug 不一定让 loss 爆炸，但会悄悄改变 objective。

Negative Log-Likelihood

许多 loss 都来自最大似然。假设模型给出条件分布 \(p_\theta(y\mid x)\)，最大化 log-likelihood 等价于最小化：

\[ \mathcal{L}_{\text{NLL}} = -\log p_\theta(y\mid x). \]

不同 loss 的区别，往往是你假设 \(y\) 来自什么分布：

Target type	Distribution	Common loss
real value	Gaussian	MSE
binary label	Bernoulli	BCE
single class	categorical	cross entropy
multi-label	independent Bernoulli	BCE per class
pair preference	Bradley-Terry	pairwise logistic
embedding match	categorical over negatives	InfoNCE

Mean Squared Error

如果假设

\[ y=f_\theta(x)+\epsilon, \qquad \epsilon\sim\mathcal{N}(0,\sigma^2I), \]

则

\[ p(y\mid x) \propto \exp\left( -\frac{1}{2\sigma^2}\|y-f_\theta(x)\|^2 \right). \]

负对数似然忽略常数后就是 MSE：

\[ \mathcal{L}_{\text{MSE}} = \|y-f_\theta(x)\|^2. \]

Proof

Gaussian log likelihood:

\[ \log p(y\mid x) = -\frac{d}{2}\log(2\pi\sigma^2) -\frac{1}{2\sigma^2}\|y-f_\theta(x)\|^2. \]

最大化它等价于最小化平方误差。

MSE 对 outlier 敏感，因为误差翻倍，loss 变四倍。回归里若噪声 heavy-tailed，可以考虑 Huber loss 或 MAE。

MAE, Huber, and Robust Regression

如果噪声不是 Gaussian，而更像 Laplace distribution：

\[ p(y\mid x) \propto \exp\left(-\frac{|y-f_\theta(x)|}{b}\right), \]

则 negative log-likelihood 对应 MAE：

\[ \mathcal{L}_{\text{MAE}} = |y-f_\theta(x)|. \]

MAE 对 outlier 更稳健，但在 \(0\) 处不可导，且梯度大小几乎不随误差大小变化。Huber loss 把 MSE 和 MAE 拼起来：

\[ \ell_\delta(r) = \begin{cases} \frac12 r^2,& |r|\le\delta,\\ \delta(|r|-\frac12\delta),& |r|>\delta, \end{cases} \]

其中 \(r=y-\hat y\)。小误差时像 MSE，提供平滑梯度；大误差时像 MAE，降低 outlier 影响。

Definition: Robust Loss

A robust loss reduces the influence of large residuals compared with squared error, often by making the gradient grow sublinearly or saturate for outliers.

Huber 的梯度是

\[ \frac{\partial \ell_\delta}{\partial \hat y} = \begin{cases} -r,& |r|\le\delta,\\ -\delta\,\operatorname{sign}(r),& |r|>\delta. \end{cases} \]

所以它不会让极端 outlier 产生无限大的梯度。目标检测框回归、价值函数回归、噪声标签回归里都常见这个思想。

Binary Cross Entropy

二分类中，模型输出 logit \(z\)，概率

\[ p=\sigma(z)=\frac{1}{1+e^{-z}}. \]

Bernoulli negative log-likelihood：

\[ \mathcal{L}_{\text{BCE}} = -y\log p-(1-y)\log(1-p). \]

更稳定的实现是 BCEWithLogitsLoss，直接输入 logits，不先手动 sigmoid。

loss = torch.nn.functional.binary_cross_entropy_with_logits(
    logits,   # [B] or [B, C]
    targets,  # float tensor, same shape
)

Pitfall: Do Not Apply Sigmoid Before BCEWithLogitsLoss

BCEWithLogitsLoss already contains a numerically stable sigmoid + BCE computation. Applying sigmoid first loses stability and changes gradients.

BCE with logits 的稳定形式是

\[ \ell(z,y) = \max(z,0)-zy+\log(1+\exp(-|z|)). \]

它避免了直接算 \(\log\sigma(z)\) 或 \(\log(1-\sigma(z))\) 时的 overflow/underflow。

Proof Sketch

Bernoulli NLL 为

\[ -y\log\sigma(z)-(1-y)\log(1-\sigma(z)). \]

利用

\[ \log\sigma(z)=z-\log(1+e^z), \qquad \log(1-\sigma(z))=-\log(1+e^z), \]

可得

\[ \ell(z,y)=\log(1+e^z)-yz. \]

再把 \(\log(1+e^z)\) 写成数值稳定的

\[ \max(z,0)+\log(1+\exp(-|z|)). \]

对 logit 的梯度是

\[ \frac{\partial \ell}{\partial z} = \sigma(z)-y. \]

这和 multiclass CE 的 \(p-y\) 形式完全一致。区别是 BCE 假设每个 label 是独立 Bernoulli，所以适合 multi-label；softmax CE 假设类别互斥，所以适合 single-label。

Pitfall: Multi-Class and Multi-Label Are Different

Use softmax cross entropy when exactly one class is correct. Use BCE over classes when multiple classes can be true simultaneously. Treating multi-label data as softmax classes forces false competition between labels.

类别不平衡时，pos_weight 会放大正例项：

loss = F.binary_cross_entropy_with_logits(
    logits,
    targets.float(),
    pos_weight=pos_weight,  # [C], roughly neg_count / pos_count
)

pos_weight 改变的是 objective，不只是 metric calibration。训练后若要把 logit 当概率用，通常还要做 threshold tuning 或 calibration。

Multiclass Cross Entropy

单标签多分类中，logits \(z\in\mathbb{R}^{K}\)：

\[ p_k=\frac{e^{z_k}}{\sum_j e^{z_j}}. \]

真实类为 \(y\)，cross entropy：

\[ \mathcal{L}_{\text{CE}} = -\log p_y = -z_y+\log\sum_j e^{z_j}. \]

梯度：

\[ \frac{\partial\mathcal{L}}{\partial z_k} = p_k-\mathbf{1}[k=y]. \]

Proof

由

\[ \mathcal{L}=-z_y+\log\sum_j e^{z_j}, \]

对 \(z_k\) 求导：

\[ \frac{\partial \mathcal{L}}{\partial z_k} = -\mathbf{1}[k=y] + \frac{e^{z_k}}{\sum_j e^{z_j}} = p_k-\mathbf{1}[k=y]. \]

PyTorch shape:

loss = torch.nn.functional.cross_entropy(
    logits,  # [B, K]
    labels,  # [B], dtype long, values 0..K-1
)

语言模型中：

loss = torch.nn.functional.cross_entropy(
    logits.reshape(-1, vocab_size),  # [B*T, V]
    labels.reshape(-1),              # [B*T]
    ignore_index=-100,
)

ignore_index=-100 常用于 pad token 或 prompt token 的 loss mask。

数值稳定的 CE 实现会先减去最大 logit：

\[ \log\sum_j e^{z_j} = m+\log\sum_j e^{z_j-m}, \qquad m=\max_j z_j. \]

因为 softmax 对整体平移不变：

\[ \operatorname{softmax}(z+c\mathbf{1}) = \operatorname{softmax}(z). \]

Proof

对任意类别 \(k\)，

\[ \frac{e^{z_k+c}}{\sum_j e^{z_j+c}} = \frac{e^c e^{z_k}}{e^c\sum_j e^{z_j}} = \frac{e^{z_k}}{\sum_j e^{z_j}}. \]

Cross entropy 也可以看成 target distribution \(q\) 到 model distribution \(p_\theta\) 的 KL：

\[ H(q,p_\theta) = H(q)+\operatorname{KL}(q\Vert p_\theta). \]

当 \(q\) 固定时，最小化 CE 等价于最小化 \(\operatorname{KL}(q\Vert p_\theta)\)。这解释了 soft labels、distillation、label smoothing 为什么都能放进同一个 cross_entropy 框架：目标不一定是 one-hot，而是一个分布。

Definition: Proper Scoring Rule

A scoring rule is proper if the expected score is optimized by predicting the true distribution. Cross entropy is proper: in expectation, it encourages calibrated probabilities rather than only correct argmax labels.

Label Smoothing

one-hot label 太尖锐，可能鼓励 overconfidence。label smoothing 把目标分布改成：

\[ q_k = (1-\epsilon)\mathbf{1}[k=y]+\frac{\epsilon}{K}. \]

loss：

\[ \mathcal{L} = -\sum_k q_k\log p_k. \]

它等价于把真实类概率从 \(1\) 降到 \(1-\epsilon+\epsilon/K\)，给其他类少量概率。效果是降低 logit margin 压力，提高校准，但也可能伤害需要极高置信度的任务。

Definition: Calibration

A classifier is calibrated if predictions with confidence \(p\) are correct approximately \(p\) fraction of the time.

Label smoothing 的 CE 可以分解为

\[ H(q,p) = (1-\epsilon)H(y,p)+\epsilon H(u,p), \qquad u_k=\frac1K. \]

因为

\[ H(u,p)=H(u)+\operatorname{KL}(u\Vert p), \]

它相当于额外惩罚模型把某些类别概率压到太接近 \(0\)。这能改善 calibration，但也会降低最大 logit margin。

对于 distillation，teacher 已经给出 soft distribution：

\[ q_k^{T} = \operatorname{softmax}(z_k^{\text{teacher}}/T). \]

此时再做 label smoothing 可能会抹掉 teacher 分布里的 dark knowledge。经验上要分清三件事：

target	contains class similarity?	smoothing advice
one-hot label	no	smoothing can help calibration
teacher soft label	yes	avoid extra smoothing unless validated
noisy weak label	uncertain	smoothing may help but can hide noise

Class Imbalance and Focal Loss

类别极不平衡时，普通 CE 会被大量 easy negatives 主导。Focal loss：

\[ \mathcal{L}_{\text{focal}} = -(1-p_y)^\gamma\log p_y. \]

当样本已被正确分类、\(p_y\) 很大时，\((1-p_y)^\gamma\) 很小，loss 被降低；困难样本得到更大相对权重。

实现时要小心：focal loss 不是替代 class weighting 的万能方案。若 label noise 高，它会放大难样本，而难样本可能正是错标样本。

Class imbalance 还有一个更直接的处理：class-weighted CE。

\[ \mathcal{L} = -w_y\log p_y. \]

其中 \(w_y\) 可以取 inverse frequency、effective number of samples，或任务指定成本。它改变的是训练分布中的类别权重，而不是简单“让少数类更准”。

Definition: Cost-Sensitive Loss

A cost-sensitive loss assigns different costs to different labels or errors, so empirical risk reflects task utility rather than raw sample frequency.

Focal loss 的调制项

\[ (1-p_y)^\gamma \]

会把容易样本降权。\(\gamma=0\) 时退化为 CE；\(\gamma\) 越大，模型越关注当前低置信度样本。常见还会加 \(\alpha_y\)：

\[ \mathcal{L}_{\text{focal}} = -\alpha_y(1-p_y)^\gamma\log p_y. \]

实现时不要用已经 detached 的 \(p_y\) 当权重，否则调制项本身没有梯度。除非你有意做 reweighting，而不是原始 focal objective。

log_probs = F.log_softmax(logits, dim=-1)
log_pt = log_probs.gather(dim=-1, index=labels[:, None]).squeeze(-1)
pt = log_pt.exp()
loss = -alpha[labels] * (1 - pt).pow(gamma) * log_pt

分类不平衡也影响 evaluation：训练 loss 下降不代表 macro-F1 上升。少数类任务至少同时看 per-class precision/recall、confusion matrix 和 calibration。

Contrastive Loss and InfoNCE

给定 query \(q\)，正样本 key \(k^+\)，负样本 keys \(\{k_j^-\}\)。InfoNCE：

\[ \mathcal{L} = - \log \frac{\exp(\operatorname{sim}(q,k^+)/\tau)} {\exp(\operatorname{sim}(q,k^+)/\tau) + \sum_j\exp(\operatorname{sim}(q,k_j^-)/\tau)}. \]

这其实就是一个 cross entropy：类别是“哪一个 key 是正样本”。

若 batch 中每个 pair 都是正样本，in-batch negatives 的 logits：

\[ S_{ij} = \frac{q_i^\top k_j}{\tau}. \]

labels 是：

\[ y_i=i. \]

PyTorch：

q = torch.nn.functional.normalize(q, dim=-1)
k = torch.nn.functional.normalize(k, dim=-1)
logits = q @ k.T / temperature  # [B, B]
labels = torch.arange(q.size(0), device=q.device)
loss = torch.nn.functional.cross_entropy(logits, labels)

Pitfall: In-Batch Negatives Assume Other Pairs Are Negative

In-batch contrastive learning treats other examples in the batch as negatives. This can be false when the batch contains semantically equivalent positives.

Temperature \(\tau\) 控制 logits scale：

\[ S_{ij}=\frac{q_i^\top k_j}{\tau}. \]

小 \(\tau\) 会放大相似度差异，让 softmax 更尖锐；大 \(\tau\) 会让分布更平。梯度可写成

\[ \frac{\partial \mathcal{L}_i}{\partial S_{ij}} = p_{ij}-\mathbf{1}[j=i]. \]

而对 similarity \(s_{ij}=q_i^\top k_j\)，

\[ \frac{\partial \mathcal{L}_i}{\partial s_{ij}} = \frac{1}{\tau} \left(p_{ij}-\mathbf{1}[j=i]\right). \]

所以温度不仅改变概率，也直接缩放梯度。过小温度可能导致梯度过尖、训练不稳；过大温度会让 negatives 区分不够。

在多卡训练中，in-batch negatives 通常要跨卡 gather：

q = F.normalize(q, dim=-1)
k = F.normalize(k, dim=-1)
k_all = all_gather_with_grad(k)  # [world_size * B, D]
logits = q @ k_all.T / temperature
labels = rank * q.size(0) + torch.arange(q.size(0), device=q.device)
loss = F.cross_entropy(logits, labels)

这里有两个坑：

labels 必须加上 rank offset；
如果希望 key encoder 也收到跨卡梯度，gather 不能把梯度断掉。

Pairwise Ranking and Preference Loss

很多任务的 supervision 不是绝对标签，而是“\(a\) 比 \(b\) 好”。设模型给样本打分 \(s_\theta(x,y)\)，pairwise logistic loss 写作

\[ \mathcal{L} = -\log\sigma(s_\theta(x,y^+)-s_\theta(x,y^-)). \]

这就是 Bradley-Terry preference model 的 NLL：两个候选的胜率为

\[ P(y^+\succ y^-\mid x) = \sigma(s_\theta(x,y^+)-s_\theta(x,y^-)). \]

Definition: Pairwise Logistic Loss

Pairwise logistic loss trains a scoring function so preferred items receive higher scores than dispreferred items, using a sigmoid model over score differences.

如果 \(s^+-s^-\) 已经很大，梯度变小；如果负样本分数更高，梯度强。它适合 reranking、reward model、preference optimization 的基础理解。

常见实现：

margin = chosen_scores - rejected_scores
loss = -F.logsigmoid(margin).mean()

若分数来自 sequence log-prob，要先决定 sequence score 是 sum 还是 mean：

\[ s_{\text{sum}}(y)=\sum_t \log p_\theta(y_t\mid x,y_{<t}), \qquad s_{\text{mean}}(y)=\frac1T\sum_t \log p_\theta(y_t\mid x,y_{<t}). \]

sum 会偏向短答案，因为 log-prob 通常为负；mean 会弱化长度影响但改变 preference semantics。DPO、reward modeling、reranking 都需要明确这个选择。

Sequence Loss: Token vs. Sequence Normalization

对语言模型，常见 token-level loss：

\[ \mathcal{L} = \frac{1}{\sum_i T_i} \sum_i\sum_{t=1}^{T_i} -\log p_\theta(x_{i,t}\mid x_{i,<t}). \]

也可以先对每条序列平均，再对 batch 平均：

\[ \mathcal{L} = \frac{1}{B} \sum_i \frac{1}{T_i} \sum_{t=1}^{T_i} -\log p_\theta(x_{i,t}\mid x_{i,<t}). \]

这两者不同。token-normalized loss 让长样本贡献更多梯度；sequence-normalized loss 让每条样本权重相等。SFT、DPO、summarization、long-context training 中这会影响模型偏向。

Loss Masking

LLM SFT 中，prompt tokens 通常不计 loss，只训练 assistant answer：

input:  <user> ... <assistant> answer tokens
labels: -100  ... -100        answer tokens

mask 后目标是：

\[ \mathcal{L} = \frac{1}{|\mathcal{M}|} \sum_{t\in\mathcal{M}} -\log p_\theta(x_t\mid x_{<t}), \]

其中 \(\mathcal{M}\) 是需要训练的位置集合。

Pitfall: Attention Mask and Loss Mask Are Different

Attention mask controls what tokens can be read. Loss mask controls which positions contribute gradients. Padding, prompts, and packed examples often need both.

实现 masked token loss 时，推荐显式 numerator/denominator：

loss_flat = F.cross_entropy(
    logits.reshape(-1, vocab),
    labels.reshape(-1),
    reduction="none",
).view_as(labels)

mask = labels.ne(ignore_index)
loss = (loss_flat * mask).sum() / mask.sum().clamp_min(1)

ignore_index 版本更简洁，但手写 numerator/denominator 更容易检查多卡、gradient accumulation 和 packing 情况。

CTC and Latent Alignment

有些序列任务没有 token-level 对齐，例如 speech recognition：输入 acoustic frames 比输出字符长很多，不知道每个字符对应哪个 frame。Connectionist Temporal Classification (CTC) 把 alignment 当成 latent variable。

给定输入 \(x\) 和目标序列 \(y\)，CTC 最大化所有能 collapse 成 \(y\) 的路径概率：

\[ p(y\mid x) = \sum_{\pi\in\mathcal{B}^{-1}(y)} \prod_{t=1}^{T}p_\theta(\pi_t\mid x). \]

其中 \(\pi_t\) 是含 blank 的 frame-level label，\(\mathcal{B}\) 会移除 blank 并合并重复 token。Loss 为

\[ \mathcal{L}_{\text{CTC}} = -\log p(y\mid x). \]

Definition: Latent Alignment Objective

A latent alignment objective sums or maximizes over unknown alignments between inputs and targets instead of requiring one observed label per timestep.

CTC 的实现通常用动态规划，不能直接枚举所有 \(\pi\)。PyTorch 接口要传 log-probs 和长度：

log_probs = F.log_softmax(frame_logits, dim=-1)  # [T, B, C]
loss = F.ctc_loss(
    log_probs,
    targets,
    input_lengths,
    target_lengths,
    blank=blank_id,
    zero_infinity=True,
)

CTC 的坑也很具体：log_probs 形状是 [T, B, C]，不是 [B, T, C]；输入长度必须不短于目标 collapse 所需长度；blank id 不能和真实 token 冲突。

Objective vs. Metric

训练 objective 和 evaluation metric 不一定一致：

Task	Common training loss	Common metric
classification	CE	accuracy/F1/AUROC
language modeling	token NLL	perplexity/downstream eval
retrieval	InfoNCE	Recall@K/MRR
generation	CE/SFT/DPO	human preference/win rate
regression	MSE/MAE	RMSE/MAE/R2

当 objective 与 metric 不一致时，要理解 loss 是 surrogate objective。它可优化、可微、稳定，但不一定直接等于真正关心的目标。

常见 mismatch：

mismatch	consequence	mitigation
CE vs F1	majority class dominates	class weights, threshold tuning
token NLL vs answer quality	short local likelihood may not imply helpfulness	SFT data quality, preference tuning
MSE vs rank correlation	good average error but bad ordering	pairwise/ranking loss
InfoNCE vs Recall@K	softmax over batch not same as corpus retrieval	hard negatives, large/global negatives
reward loss vs human preference	reward model overfits annotation artifacts	held-out preference eval

Surrogate loss 的好处是稳定可微，坏处是它可能优化出 metric 不想要的行为。因此训练日志里至少要同时记录 objective 和 task metric；只看 loss 很容易误判。

Engineering Checklist

Check	Why
logits vs. probabilities	CE/BCEWithLogits 要 logits
label dtype	CE labels 是 long，BCE targets 是 float
shape	`[B,K]` vs `[B,T,V]` 很容易错
ignore index	pad/prompt token 是否被 mask
reduction	token mean 还是 sequence mean
class weights	是否需要处理 class imbalance
temperature	contrastive logits scale 是否合适
distributed loss	多卡上 denominator 是否全局一致

Loss function 是模型训练的合同。把这个合同写错，模型可能仍然下降，但它在学的不是你以为的东西。

Implementation Checklist

实现或排查 objective 时，可以按下面顺序查：

输入给 CE/BCE 的是 logits 还是 probability；
labels 的 dtype 是否正确，CE 用 long，BCE 用 float；
logits 和 labels 的 shape 是否精确匹配任务语义；
language model 是否做了正确 label shift；
padding、prompt、packed document boundary 是否进入 loss mask；
loss denominator 是 sample、sequence、token 还是 valid token；
gradient accumulation 和 DDP 是否使用全局一致 denominator；
class weights、focal、pos_weight 是否改变了 calibration；
contrastive labels 是否含 rank offset，gather 是否保留需要的梯度；
preference/ranking score 是 token sum 还是 token mean；
CTC/latent alignment 的 input length、target length、blank id 是否正确；
training objective 和 validation metric 是否同时记录。

三个 smoke tests：

# 1. cross entropy should reject probability-shaped thinking:
# logits can be any real values, labels are class ids.
logits = torch.randn(4, 7)
labels = torch.tensor([0, 3, 2, 6])
loss = F.cross_entropy(logits, labels)
assert loss.ndim == 0

# 2. LM label shift preserves one fewer timestep
idx = torch.randint(0, vocab, (2, 8))
inp, tgt = idx[:, :-1], idx[:, 1:]
logits = model(inp).logits
assert logits.shape[:2] == tgt.shape

# 3. masked denominator counts only valid tokens
labels = torch.tensor([[1, 2, -100], [3, -100, -100]])
mask = labels.ne(-100)
assert mask.sum().item() == 3

这些测试很小，但它们直接对应最常见的 objective bug：把概率喂给 logits loss、LM target 错位、以及把 padding/prompt token 算进 denominator。