2.9 Loss Functions and Objectives
Loss function 不是训练代码里的一个参数。它定义了模型在学习什么概率假设、哪些错误更严重、梯度如何流回 logits、样本如何加权,以及 evaluation metric 与训练目标之间是否一致。
同一个网络结构,换一个 objective,学到的行为可能完全不同。比如 Transformer 可以做 autoregressive LM、masked LM、contrastive retrieval、classification、ranking 或 preference optimization;区别主要不在 block,而在 loss 如何构造。
Empirical Risk
Given examples \(\{(x_i,y_i)\}_{i=1}^{n}\) and loss \(\ell\), empirical risk minimization solves \[ \min_\theta \hat{R}(\theta) = \frac{1}{n}\sum_{i=1}^{n}\ell(f_\theta(x_i),y_i). \]
实现时还要决定 reduction:
loss = criterion(logits, labels) # mean by default for many PyTorch losses
loss = per_token_loss.mean() # average over valid tokens
loss = per_sample_loss.sum() / n_tokens # token-normalizedmean 到底是按 sample、token、non-pad token 还是 batch element 平均,会直接改变有效 learning rate。
If one implementation averages over sequences and another averages over tokens, the same nominal learning rate can produce different gradient magnitudes.
Empirical risk 是 population risk 的样本近似。真正想优化的是
\[ R(\theta) = \mathbb{E}_{(x,y)\sim p_{\text{data}}} \ell(f_\theta(x),y), \]
但我们只有有限训练集,所以用
\[ \hat R(\theta) = \frac1n\sum_i \ell(f_\theta(x_i),y_i). \]
这带来两个重要问题:
- loss 是否是想要的统计假设;
- finite-sample estimate 是否被 masking、sampling、class weighting 改坏。
在深度学习代码里,第二点经常比第一点更容易出错。比如语言模型中同一个 batch 的有效 token 数可能不同,若直接 mean(),每张卡、每个 micro-batch 的 denominator 可能不一致。
The loss denominator is the quantity used to normalize summed per-example or per-token losses. It determines gradient scale and should match the intended unit of optimization: examples, sequences, tokens, or valid masked tokens.
分布式训练里,token-normalized loss 更应该写成全局 numerator / 全局 denominator:
local_loss_sum = (loss_per_token * mask).sum()
local_count = mask.sum()
global_count = local_count.detach().clone()
dist.all_reduce(global_count, op=dist.ReduceOp.SUM)
# DDP averages gradients across ranks, so multiply by world size.
loss = local_loss_sum * dist.get_world_size() / global_count.clamp_min(1)如果每张卡先各自除以本地 token 数,再平均梯度,长短样本在不同卡上的权重会改变。这类 bug 不一定让 loss 爆炸,但会悄悄改变 objective。
Negative Log-Likelihood
许多 loss 都来自最大似然。假设模型给出条件分布 \(p_\theta(y\mid x)\),最大化 log-likelihood 等价于最小化:
\[ \mathcal{L}_{\text{NLL}} = -\log p_\theta(y\mid x). \]
不同 loss 的区别,往往是你假设 \(y\) 来自什么分布:
| Target type | Distribution | Common loss |
|---|---|---|
| real value | Gaussian | MSE |
| binary label | Bernoulli | BCE |
| single class | categorical | cross entropy |
| multi-label | independent Bernoulli | BCE per class |
| pair preference | Bradley-Terry | pairwise logistic |
| embedding match | categorical over negatives | InfoNCE |
Mean Squared Error
如果假设
\[ y=f_\theta(x)+\epsilon, \qquad \epsilon\sim\mathcal{N}(0,\sigma^2I), \]
则
\[ p(y\mid x) \propto \exp\left( -\frac{1}{2\sigma^2}\|y-f_\theta(x)\|^2 \right). \]
负对数似然忽略常数后就是 MSE:
\[ \mathcal{L}_{\text{MSE}} = \|y-f_\theta(x)\|^2. \]
Gaussian log likelihood:
\[ \log p(y\mid x) = -\frac{d}{2}\log(2\pi\sigma^2) -\frac{1}{2\sigma^2}\|y-f_\theta(x)\|^2. \]
最大化它等价于最小化平方误差。
MSE 对 outlier 敏感,因为误差翻倍,loss 变四倍。回归里若噪声 heavy-tailed,可以考虑 Huber loss 或 MAE。
MAE, Huber, and Robust Regression
如果噪声不是 Gaussian,而更像 Laplace distribution:
\[ p(y\mid x) \propto \exp\left(-\frac{|y-f_\theta(x)|}{b}\right), \]
则 negative log-likelihood 对应 MAE:
\[ \mathcal{L}_{\text{MAE}} = |y-f_\theta(x)|. \]
MAE 对 outlier 更稳健,但在 \(0\) 处不可导,且梯度大小几乎不随误差大小变化。Huber loss 把 MSE 和 MAE 拼起来:
\[ \ell_\delta(r) = \begin{cases} \frac12 r^2,& |r|\le\delta,\\ \delta(|r|-\frac12\delta),& |r|>\delta, \end{cases} \]
其中 \(r=y-\hat y\)。小误差时像 MSE,提供平滑梯度;大误差时像 MAE,降低 outlier 影响。
A robust loss reduces the influence of large residuals compared with squared error, often by making the gradient grow sublinearly or saturate for outliers.
Huber 的梯度是
\[ \frac{\partial \ell_\delta}{\partial \hat y} = \begin{cases} -r,& |r|\le\delta,\\ -\delta\,\operatorname{sign}(r),& |r|>\delta. \end{cases} \]
所以它不会让极端 outlier 产生无限大的梯度。目标检测框回归、价值函数回归、噪声标签回归里都常见这个思想。
Binary Cross Entropy
二分类中,模型输出 logit \(z\),概率
\[ p=\sigma(z)=\frac{1}{1+e^{-z}}. \]
Bernoulli negative log-likelihood:
\[ \mathcal{L}_{\text{BCE}} = -y\log p-(1-y)\log(1-p). \]
更稳定的实现是 BCEWithLogitsLoss,直接输入 logits,不先手动 sigmoid。
loss = torch.nn.functional.binary_cross_entropy_with_logits(
logits, # [B] or [B, C]
targets, # float tensor, same shape
)BCEWithLogitsLoss already contains a numerically stable sigmoid + BCE computation. Applying sigmoid first loses stability and changes gradients.
BCE with logits 的稳定形式是
\[ \ell(z,y) = \max(z,0)-zy+\log(1+\exp(-|z|)). \]
它避免了直接算 \(\log\sigma(z)\) 或 \(\log(1-\sigma(z))\) 时的 overflow/underflow。
Bernoulli NLL 为
\[ -y\log\sigma(z)-(1-y)\log(1-\sigma(z)). \]
利用
\[ \log\sigma(z)=z-\log(1+e^z), \qquad \log(1-\sigma(z))=-\log(1+e^z), \]
可得
\[ \ell(z,y)=\log(1+e^z)-yz. \]
再把 \(\log(1+e^z)\) 写成数值稳定的
\[ \max(z,0)+\log(1+\exp(-|z|)). \]
对 logit 的梯度是
\[ \frac{\partial \ell}{\partial z} = \sigma(z)-y. \]
这和 multiclass CE 的 \(p-y\) 形式完全一致。区别是 BCE 假设每个 label 是独立 Bernoulli,所以适合 multi-label;softmax CE 假设类别互斥,所以适合 single-label。
Use softmax cross entropy when exactly one class is correct. Use BCE over classes when multiple classes can be true simultaneously. Treating multi-label data as softmax classes forces false competition between labels.
类别不平衡时,pos_weight 会放大正例项:
loss = F.binary_cross_entropy_with_logits(
logits,
targets.float(),
pos_weight=pos_weight, # [C], roughly neg_count / pos_count
)pos_weight 改变的是 objective,不只是 metric calibration。训练后若要把 logit 当概率用,通常还要做 threshold tuning 或 calibration。
Multiclass Cross Entropy
单标签多分类中,logits \(z\in\mathbb{R}^{K}\):
\[ p_k=\frac{e^{z_k}}{\sum_j e^{z_j}}. \]
真实类为 \(y\),cross entropy:
\[ \mathcal{L}_{\text{CE}} = -\log p_y = -z_y+\log\sum_j e^{z_j}. \]
梯度:
\[ \frac{\partial\mathcal{L}}{\partial z_k} = p_k-\mathbf{1}[k=y]. \]
由
\[ \mathcal{L}=-z_y+\log\sum_j e^{z_j}, \]
对 \(z_k\) 求导:
\[ \frac{\partial \mathcal{L}}{\partial z_k} = -\mathbf{1}[k=y] + \frac{e^{z_k}}{\sum_j e^{z_j}} = p_k-\mathbf{1}[k=y]. \]
PyTorch shape:
loss = torch.nn.functional.cross_entropy(
logits, # [B, K]
labels, # [B], dtype long, values 0..K-1
)语言模型中:
loss = torch.nn.functional.cross_entropy(
logits.reshape(-1, vocab_size), # [B*T, V]
labels.reshape(-1), # [B*T]
ignore_index=-100,
)ignore_index=-100 常用于 pad token 或 prompt token 的 loss mask。
数值稳定的 CE 实现会先减去最大 logit:
\[ \log\sum_j e^{z_j} = m+\log\sum_j e^{z_j-m}, \qquad m=\max_j z_j. \]
因为 softmax 对整体平移不变:
\[ \operatorname{softmax}(z+c\mathbf{1}) = \operatorname{softmax}(z). \]
对任意类别 \(k\),
\[ \frac{e^{z_k+c}}{\sum_j e^{z_j+c}} = \frac{e^c e^{z_k}}{e^c\sum_j e^{z_j}} = \frac{e^{z_k}}{\sum_j e^{z_j}}. \]
Cross entropy 也可以看成 target distribution \(q\) 到 model distribution \(p_\theta\) 的 KL:
\[ H(q,p_\theta) = H(q)+\operatorname{KL}(q\Vert p_\theta). \]
当 \(q\) 固定时,最小化 CE 等价于最小化 \(\operatorname{KL}(q\Vert p_\theta)\)。这解释了 soft labels、distillation、label smoothing 为什么都能放进同一个 cross_entropy 框架:目标不一定是 one-hot,而是一个分布。
A scoring rule is proper if the expected score is optimized by predicting the true distribution. Cross entropy is proper: in expectation, it encourages calibrated probabilities rather than only correct argmax labels.
Label Smoothing
one-hot label 太尖锐,可能鼓励 overconfidence。label smoothing 把目标分布改成:
\[ q_k = (1-\epsilon)\mathbf{1}[k=y]+\frac{\epsilon}{K}. \]
loss:
\[ \mathcal{L} = -\sum_k q_k\log p_k. \]
它等价于把真实类概率从 \(1\) 降到 \(1-\epsilon+\epsilon/K\),给其他类少量概率。效果是降低 logit margin 压力,提高校准,但也可能伤害需要极高置信度的任务。
A classifier is calibrated if predictions with confidence \(p\) are correct approximately \(p\) fraction of the time.
Label smoothing 的 CE 可以分解为
\[ H(q,p) = (1-\epsilon)H(y,p)+\epsilon H(u,p), \qquad u_k=\frac1K. \]
因为
\[ H(u,p)=H(u)+\operatorname{KL}(u\Vert p), \]
它相当于额外惩罚模型把某些类别概率压到太接近 \(0\)。这能改善 calibration,但也会降低最大 logit margin。
对于 distillation,teacher 已经给出 soft distribution:
\[ q_k^{T} = \operatorname{softmax}(z_k^{\text{teacher}}/T). \]
此时再做 label smoothing 可能会抹掉 teacher 分布里的 dark knowledge。经验上要分清三件事:
| target | contains class similarity? | smoothing advice |
|---|---|---|
| one-hot label | no | smoothing can help calibration |
| teacher soft label | yes | avoid extra smoothing unless validated |
| noisy weak label | uncertain | smoothing may help but can hide noise |
Class Imbalance and Focal Loss
类别极不平衡时,普通 CE 会被大量 easy negatives 主导。Focal loss:
\[ \mathcal{L}_{\text{focal}} = -(1-p_y)^\gamma\log p_y. \]
当样本已被正确分类、\(p_y\) 很大时,\((1-p_y)^\gamma\) 很小,loss 被降低;困难样本得到更大相对权重。
实现时要小心:focal loss 不是替代 class weighting 的万能方案。若 label noise 高,它会放大难样本,而难样本可能正是错标样本。
Class imbalance 还有一个更直接的处理:class-weighted CE。
\[ \mathcal{L} = -w_y\log p_y. \]
其中 \(w_y\) 可以取 inverse frequency、effective number of samples,或任务指定成本。它改变的是训练分布中的类别权重,而不是简单“让少数类更准”。
A cost-sensitive loss assigns different costs to different labels or errors, so empirical risk reflects task utility rather than raw sample frequency.
Focal loss 的调制项
\[ (1-p_y)^\gamma \]
会把容易样本降权。\(\gamma=0\) 时退化为 CE;\(\gamma\) 越大,模型越关注当前低置信度样本。常见还会加 \(\alpha_y\):
\[ \mathcal{L}_{\text{focal}} = -\alpha_y(1-p_y)^\gamma\log p_y. \]
实现时不要用已经 detached 的 \(p_y\) 当权重,否则调制项本身没有梯度。除非你有意做 reweighting,而不是原始 focal objective。
log_probs = F.log_softmax(logits, dim=-1)
log_pt = log_probs.gather(dim=-1, index=labels[:, None]).squeeze(-1)
pt = log_pt.exp()
loss = -alpha[labels] * (1 - pt).pow(gamma) * log_pt分类不平衡也影响 evaluation:训练 loss 下降不代表 macro-F1 上升。少数类任务至少同时看 per-class precision/recall、confusion matrix 和 calibration。
Contrastive Loss and InfoNCE
给定 query \(q\),正样本 key \(k^+\),负样本 keys \(\{k_j^-\}\)。InfoNCE:
\[ \mathcal{L} = - \log \frac{\exp(\operatorname{sim}(q,k^+)/\tau)} {\exp(\operatorname{sim}(q,k^+)/\tau) + \sum_j\exp(\operatorname{sim}(q,k_j^-)/\tau)}. \]
这其实就是一个 cross entropy:类别是“哪一个 key 是正样本”。
若 batch 中每个 pair 都是正样本,in-batch negatives 的 logits:
\[ S_{ij} = \frac{q_i^\top k_j}{\tau}. \]
labels 是:
\[ y_i=i. \]
PyTorch:
q = torch.nn.functional.normalize(q, dim=-1)
k = torch.nn.functional.normalize(k, dim=-1)
logits = q @ k.T / temperature # [B, B]
labels = torch.arange(q.size(0), device=q.device)
loss = torch.nn.functional.cross_entropy(logits, labels)In-batch contrastive learning treats other examples in the batch as negatives. This can be false when the batch contains semantically equivalent positives.
Temperature \(\tau\) 控制 logits scale:
\[ S_{ij}=\frac{q_i^\top k_j}{\tau}. \]
小 \(\tau\) 会放大相似度差异,让 softmax 更尖锐;大 \(\tau\) 会让分布更平。梯度可写成
\[ \frac{\partial \mathcal{L}_i}{\partial S_{ij}} = p_{ij}-\mathbf{1}[j=i]. \]
而对 similarity \(s_{ij}=q_i^\top k_j\),
\[ \frac{\partial \mathcal{L}_i}{\partial s_{ij}} = \frac{1}{\tau} \left(p_{ij}-\mathbf{1}[j=i]\right). \]
所以温度不仅改变概率,也直接缩放梯度。过小温度可能导致梯度过尖、训练不稳;过大温度会让 negatives 区分不够。
在多卡训练中,in-batch negatives 通常要跨卡 gather:
q = F.normalize(q, dim=-1)
k = F.normalize(k, dim=-1)
k_all = all_gather_with_grad(k) # [world_size * B, D]
logits = q @ k_all.T / temperature
labels = rank * q.size(0) + torch.arange(q.size(0), device=q.device)
loss = F.cross_entropy(logits, labels)这里有两个坑:
labels必须加上 rank offset;- 如果希望 key encoder 也收到跨卡梯度,gather 不能把梯度断掉。
Pairwise Ranking and Preference Loss
很多任务的 supervision 不是绝对标签,而是“\(a\) 比 \(b\) 好”。设模型给样本打分 \(s_\theta(x,y)\),pairwise logistic loss 写作
\[ \mathcal{L} = -\log\sigma(s_\theta(x,y^+)-s_\theta(x,y^-)). \]
这就是 Bradley-Terry preference model 的 NLL:两个候选的胜率为
\[ P(y^+\succ y^-\mid x) = \sigma(s_\theta(x,y^+)-s_\theta(x,y^-)). \]
Pairwise logistic loss trains a scoring function so preferred items receive higher scores than dispreferred items, using a sigmoid model over score differences.
如果 \(s^+-s^-\) 已经很大,梯度变小;如果负样本分数更高,梯度强。它适合 reranking、reward model、preference optimization 的基础理解。
常见实现:
margin = chosen_scores - rejected_scores
loss = -F.logsigmoid(margin).mean()若分数来自 sequence log-prob,要先决定 sequence score 是 sum 还是 mean:
\[ s_{\text{sum}}(y)=\sum_t \log p_\theta(y_t\mid x,y_{<t}), \qquad s_{\text{mean}}(y)=\frac1T\sum_t \log p_\theta(y_t\mid x,y_{<t}). \]
sum 会偏向短答案,因为 log-prob 通常为负;mean 会弱化长度影响但改变 preference semantics。DPO、reward modeling、reranking 都需要明确这个选择。
Sequence Loss: Token vs. Sequence Normalization
对语言模型,常见 token-level loss:
\[ \mathcal{L} = \frac{1}{\sum_i T_i} \sum_i\sum_{t=1}^{T_i} -\log p_\theta(x_{i,t}\mid x_{i,<t}). \]
也可以先对每条序列平均,再对 batch 平均:
\[ \mathcal{L} = \frac{1}{B} \sum_i \frac{1}{T_i} \sum_{t=1}^{T_i} -\log p_\theta(x_{i,t}\mid x_{i,<t}). \]
这两者不同。token-normalized loss 让长样本贡献更多梯度;sequence-normalized loss 让每条样本权重相等。SFT、DPO、summarization、long-context training 中这会影响模型偏向。
Loss Masking
LLM SFT 中,prompt tokens 通常不计 loss,只训练 assistant answer:
input: <user> ... <assistant> answer tokens
labels: -100 ... -100 answer tokens
mask 后目标是:
\[ \mathcal{L} = \frac{1}{|\mathcal{M}|} \sum_{t\in\mathcal{M}} -\log p_\theta(x_t\mid x_{<t}), \]
其中 \(\mathcal{M}\) 是需要训练的位置集合。
Attention mask controls what tokens can be read. Loss mask controls which positions contribute gradients. Padding, prompts, and packed examples often need both.
实现 masked token loss 时,推荐显式 numerator/denominator:
loss_flat = F.cross_entropy(
logits.reshape(-1, vocab),
labels.reshape(-1),
reduction="none",
).view_as(labels)
mask = labels.ne(ignore_index)
loss = (loss_flat * mask).sum() / mask.sum().clamp_min(1)ignore_index 版本更简洁,但手写 numerator/denominator 更容易检查多卡、gradient accumulation 和 packing 情况。
CTC and Latent Alignment
有些序列任务没有 token-level 对齐,例如 speech recognition:输入 acoustic frames 比输出字符长很多,不知道每个字符对应哪个 frame。Connectionist Temporal Classification (CTC) 把 alignment 当成 latent variable。
给定输入 \(x\) 和目标序列 \(y\),CTC 最大化所有能 collapse 成 \(y\) 的路径概率:
\[ p(y\mid x) = \sum_{\pi\in\mathcal{B}^{-1}(y)} \prod_{t=1}^{T}p_\theta(\pi_t\mid x). \]
其中 \(\pi_t\) 是含 blank 的 frame-level label,\(\mathcal{B}\) 会移除 blank 并合并重复 token。Loss 为
\[ \mathcal{L}_{\text{CTC}} = -\log p(y\mid x). \]
A latent alignment objective sums or maximizes over unknown alignments between inputs and targets instead of requiring one observed label per timestep.
CTC 的实现通常用动态规划,不能直接枚举所有 \(\pi\)。PyTorch 接口要传 log-probs 和长度:
log_probs = F.log_softmax(frame_logits, dim=-1) # [T, B, C]
loss = F.ctc_loss(
log_probs,
targets,
input_lengths,
target_lengths,
blank=blank_id,
zero_infinity=True,
)CTC 的坑也很具体:log_probs 形状是 [T, B, C],不是 [B, T, C];输入长度必须不短于目标 collapse 所需长度;blank id 不能和真实 token 冲突。
Objective vs. Metric
训练 objective 和 evaluation metric 不一定一致:
| Task | Common training loss | Common metric |
|---|---|---|
| classification | CE | accuracy/F1/AUROC |
| language modeling | token NLL | perplexity/downstream eval |
| retrieval | InfoNCE | Recall@K/MRR |
| generation | CE/SFT/DPO | human preference/win rate |
| regression | MSE/MAE | RMSE/MAE/R2 |
当 objective 与 metric 不一致时,要理解 loss 是 surrogate objective。它可优化、可微、稳定,但不一定直接等于真正关心的目标。
常见 mismatch:
| mismatch | consequence | mitigation |
|---|---|---|
| CE vs F1 | majority class dominates | class weights, threshold tuning |
| token NLL vs answer quality | short local likelihood may not imply helpfulness | SFT data quality, preference tuning |
| MSE vs rank correlation | good average error but bad ordering | pairwise/ranking loss |
| InfoNCE vs Recall@K | softmax over batch not same as corpus retrieval | hard negatives, large/global negatives |
| reward loss vs human preference | reward model overfits annotation artifacts | held-out preference eval |
Surrogate loss 的好处是稳定可微,坏处是它可能优化出 metric 不想要的行为。因此训练日志里至少要同时记录 objective 和 task metric;只看 loss 很容易误判。
Engineering Checklist
| Check | Why |
|---|---|
| logits vs. probabilities | CE/BCEWithLogits 要 logits |
| label dtype | CE labels 是 long,BCE targets 是 float |
| shape | [B,K] vs [B,T,V] 很容易错 |
| ignore index | pad/prompt token 是否被 mask |
| reduction | token mean 还是 sequence mean |
| class weights | 是否需要处理 class imbalance |
| temperature | contrastive logits scale 是否合适 |
| distributed loss | 多卡上 denominator 是否全局一致 |
Loss function 是模型训练的合同。把这个合同写错,模型可能仍然下降,但它在学的不是你以为的东西。
Implementation Checklist
实现或排查 objective 时,可以按下面顺序查:
- 输入给 CE/BCE 的是 logits 还是 probability;
- labels 的 dtype 是否正确,CE 用
long,BCE 用 float; - logits 和 labels 的 shape 是否精确匹配任务语义;
- language model 是否做了正确 label shift;
- padding、prompt、packed document boundary 是否进入 loss mask;
- loss denominator 是 sample、sequence、token 还是 valid token;
- gradient accumulation 和 DDP 是否使用全局一致 denominator;
- class weights、focal、pos_weight 是否改变了 calibration;
- contrastive labels 是否含 rank offset,gather 是否保留需要的梯度;
- preference/ranking score 是 token sum 还是 token mean;
- CTC/latent alignment 的 input length、target length、blank id 是否正确;
- training objective 和 validation metric 是否同时记录。
三个 smoke tests:
# 1. cross entropy should reject probability-shaped thinking:
# logits can be any real values, labels are class ids.
logits = torch.randn(4, 7)
labels = torch.tensor([0, 3, 2, 6])
loss = F.cross_entropy(logits, labels)
assert loss.ndim == 0
# 2. LM label shift preserves one fewer timestep
idx = torch.randint(0, vocab, (2, 8))
inp, tgt = idx[:, :-1], idx[:, 1:]
logits = model(inp).logits
assert logits.shape[:2] == tgt.shape
# 3. masked denominator counts only valid tokens
labels = torch.tensor([[1, 2, -100], [3, -100, -100]])
mask = labels.ne(-100)
assert mask.sum().item() == 3这些测试很小,但它们直接对应最常见的 objective bug:把概率喂给 logits loss、LM target 错位、以及把 padding/prompt token 算进 denominator。