3.7 Advanced Convex Optimization Methods
凸优化不是因为深度学习 loss 是凸的才重要。它重要是因为它给 optimizer 提供语言:smoothness 决定步长,strong convexity 决定曲率下界,projection 处理约束,proximal operator 处理不可微正则,mirror descent 改变几何,Newton 方法利用二阶曲率。
Smoothness and Descent Lemma
A differentiable function \(f\) is \(L\)-smooth if \[ \|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\| \] for all \(x,y\).
若 \(f\) 是 \(L\)-smooth,则:
\[ f(y) \leq f(x)+\nabla f(x)^\top(y-x)+\frac{L}{2}\|y-x\|^2. \]
令 \(g(t)=f(x+t(y-x))\),则
\[ f(y)-f(x) = \int_0^1\nabla f(x+t(y-x))^\top(y-x)dt. \]
加减 \(\nabla f(x)\):
\[ = \nabla f(x)^\top(y-x) + \int_0^1 \left(\nabla f(x+t(y-x))-\nabla f(x)\right)^\top(y-x)dt. \]
由 Cauchy-Schwarz 和 smoothness:
\[ \left| \left(\nabla f(x+t(y-x))-\nabla f(x)\right)^\top(y-x) \right| \leq Lt\|y-x\|^2. \]
积分 \(\int_0^1Lt\,dt=L/2\),得到结论。
Gradient Descent Rate
Gradient descent:
\[ x_{k+1}=x_k-\eta\nabla f(x_k). \]
若 \(f\) convex 且 \(L\)-smooth,取 \(\eta=1/L\),则:
\[ f(x_k)-f(x^\star) \leq \frac{L\|x_0-x^\star\|^2}{2k}. \]
由 descent lemma:
\[ f(x^+)\leq f(x)-\frac{1}{2L}\|\nabla f(x)\|^2, \qquad x^+=x-\frac{1}{L}\nabla f(x). \]
展开距离:
\[ \|x^+-x^\star\|^2 = \left\|x-x^\star-\frac{1}{L}\nabla f(x)\right\|^2. \]
结合 convexity:
\[ f(x)-f(x^\star) \leq \nabla f(x)^\top(x-x^\star), \]
可得
\[ f(x^+)-f(x^\star) \leq \frac{L}{2} \left( \|x-x^\star\|^2-\|x^+-x^\star\|^2 \right). \]
对 \(k\) 次迭代求和,距离项 telescoping,得到 \(O(1/k)\) 收敛率。
Backtracking Line Search
理论里常写 \(\eta=1/L\),但实际问题很少知道精确 \(L\)。Backtracking line search 用局部下降条件自适应找步长。给定下降方向 \(d_k\),从初始步长 \(\eta_0\) 开始,不断缩小:
\[ \eta\leftarrow \beta\eta, \qquad 0<\beta<1, \]
直到 Armijo condition 成立:
\[ f(x_k+\eta d_k) \leq f(x_k)+c\eta\nabla f(x_k)^\top d_k, \qquad 0<c<1. \]
对 gradient descent,\(d_k=-\nabla f(x_k)\),右侧是:
\[ f(x_k)-c\eta\|\nabla f(x_k)\|^2. \]
所以 line search 要求真实函数值至少下降到一个“按梯度大小预测的保守下降量”。如果 \(f\) 是 \(L\)-smooth,descent lemma 给:
\[ f(x-\eta\nabla f(x)) \leq f(x)-\eta\left(1-\frac{L\eta}{2}\right)\|\nabla f(x)\|^2. \]
当 \(\eta\leq 2(1-c)/L\) 时,Armijo 条件一定成立。因此 backtracking 会在有限次缩小后停止。
The Armijo condition accepts a step when the actual objective decrease is at least a fixed fraction of the linearized decrease predicted by the descent direction.
def backtracking(f, grad, x, direction, eta, beta, c):
fx = f(x)
gtd = (grad * direction).sum()
while f(x + eta * direction) > fx + c * eta * gtd:
eta *= beta
return eta深度学习主训练很少对每个 mini-batch 做 line search,因为多次 forward 太贵且 stochastic loss 噪声大。但 convex solver、L-BFGS、small-data fine-tuning、adversarial inner maximization 都会用到这个思想:学习率不是玄学常数,而是通过下降条件被局部验证。
Strong Convexity
A differentiable function \(f\) is \(\mu\)-strongly convex if \[ f(y) \geq f(x)+\nabla f(x)^\top(y-x)+\frac{\mu}{2}\|y-x\|^2. \]
若 \(f\) 二阶可微,则 strong convexity 等价于:
\[ \nabla^2 f(x)\succeq \mu I. \]
对于 \(\mu\)-strongly convex 且 \(L\)-smooth 的函数,gradient descent 线性收敛:
\[ \|x_k-x^\star\|^2 \leq \left(1-\frac{\mu}{L}\right)^k\|x_0-x^\star\|^2. \]
条件数
\[ \kappa=\frac{L}{\mu} \]
决定优化难度。\(\kappa\) 大时,loss landscape 像狭长 valley,普通 GD 会在两侧震荡。
Condition Number and Preconditioning
对二次函数
\[ f(x)=\frac{1}{2}x^\top A x-b^\top x, \qquad A\succ0, \]
\(L=\lambda_{\max}(A)\),\(\mu=\lambda_{\min}(A)\),所以:
\[ \kappa(A)=\frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}. \]
梯度下降更新误差 \(e_k=x_k-x^\star\):
\[ e_{k+1} = (I-\eta A)e_k. \]
若沿 \(A\) 的特征向量分解 \(e_k=\sum_i c_{k,i}u_i\),则每个方向独立收缩:
\[ c_{k+1,i} = (1-\eta\lambda_i)c_{k,i}. \]
这就是 ill-conditioning 的根源:同一个 learning rate \(\eta\) 要同时照顾大曲率方向和小曲率方向。\(\eta\) 太大,大曲率方向震荡;\(\eta\) 太小,小曲率方向进展很慢。
最优常数步长是:
\[ \eta^\star=\frac{2}{L+\mu}, \]
收缩因子为:
\[ \rho^\star = \frac{\kappa-1}{\kappa+1}. \]
当 \(\kappa\) 很大时,
\[ \rho^\star\approx 1-\frac{2}{\kappa}, \]
所以需要 \(O(\kappa\log(1/\epsilon))\) 步。预条件的思想是找一个矩阵 \(P\succ0\),改成:
\[ x_{k+1}=x_k-\eta P^{-1}\nabla f(x_k). \]
若 \(P\approx A\),更新空间里的有效 Hessian 变成 \(P^{-1/2}AP^{-1/2}\),条件数变小。Adam、RMSProp、Shampoo、K-FAC、natural gradient 都可以从这个角度理解:它们在构造便宜的、近似的、可在线更新的 preconditioner。
A preconditioner changes the geometry of gradient steps so that directions with different curvature are scaled more evenly.
Accelerated Gradient
Nesterov accelerated gradient:
\[ y_k=x_k+\beta_k(x_k-x_{k-1}), \]
\[ x_{k+1}=y_k-\frac{1}{L}\nabla f(y_k). \]
对 smooth convex function,它可达到:
\[ f(x_k)-f(x^\star)=O\left(\frac{1}{k^2}\right), \]
优于普通 GD 的 \(O(1/k)\)。直觉是利用历史方向,但严格推导依赖 estimate sequence,而不是简单物理动量类比。
Potential View of Acceleration
一个简化的理解是:GD 只控制函数值下降,而 accelerated methods 同时维护“函数值误差 + 动量距离项”的势能。典型势能形状是:
\[ \Phi_k = A_k\left(f(x_k)-f(x^\star)\right) +\frac{1}{2}\|v_k-x^\star\|^2. \]
如果能设计 \(A_k\)、\(v_k\) 和更新规则,使得:
\[ \Phi_{k+1}\leq \Phi_k, \]
且 \(A_k\) 增长为 \(O(k^2/L)\),就得到:
\[ f(x_k)-f(x^\star) \leq \frac{\Phi_0}{A_k} = O\left(\frac{L\|x_0-x^\star\|^2}{k^2}\right). \]
这比“速度 + 摩擦”的比喻更精确:acceleration 的关键是选择一个会 telescope 的势能,而不是简单把上一步方向加进去。在非凸深度网络中,Nesterov/momentum 仍然常用,但它的理论保证不再自动成立;学习率、warmup 和 gradient clipping 往往是在替这个势能条件兜底。
Acceleration improves convex rates under assumptions, but in nonconvex deep learning large momentum may amplify oscillations when curvature changes quickly.
Subgradient Method
很多 convex objective 不可微,例如 hinge loss、\(\ell_1\) regularization、max-margin loss。此时不能用 ordinary gradient,但可以用 subgradient。
For a convex function \(f\), a vector \(g\in\partial f(x)\) is a subgradient at \(x\) if \[ f(y)\geq f(x)+g^\top(y-x) \] for all \(y\).
Subgradient descent:
\[ x_{k+1}=x_k-\eta_k g_k, \qquad g_k\in\partial f(x_k). \]
和 smooth GD 不同,subgradient step 不保证每一步下降。它的理论依赖 averaged iterate:
\[ \bar{x}_K = \frac{\sum_{k=0}^{K-1}\eta_k x_k} {\sum_{k=0}^{K-1}\eta_k}. \]
若 \(f\) convex,\(\|g_k\|\leq G\),且 \(\|x_0-x^\star\|\leq R\),取常数步长 \(\eta=R/(G\sqrt{K})\),则:
\[ f(\bar{x}_K)-f(x^\star) \leq \frac{RG}{\sqrt{K}}. \]
展开距离:
\[ \|x_{k+1}-x^\star\|^2 = \|x_k-x^\star\|^2 -2\eta_k g_k^\top(x_k-x^\star) +\eta_k^2\|g_k\|^2. \]
由 convexity 的 subgradient inequality:
\[ f(x_k)-f(x^\star) \leq g_k^\top(x_k-x^\star). \]
所以
\[ 2\eta_k(f(x_k)-f(x^\star)) \leq \|x_k-x^\star\|^2-\|x_{k+1}-x^\star\|^2 +\eta_k^2G^2. \]
对 \(k\) 求和,距离项 telescoping,再用 Jensen inequality 把加权平均 iterate 拉出来,就得到 \(O(1/\sqrt{K})\) rate。
这比 smooth convex GD 的 \(O(1/K)\) 慢很多。直觉是:不可微点附近没有稳定的局部二次上界,subgradient 只能告诉你一个支撑超平面方向,而不是精确曲率。
The classical subgradient rate assumes bounded subgradients and bounded distance to optimum. In deep networks, neither assumption is automatic.
Projected Gradient Descent
约束问题:
\[ \min_{x\in\mathcal{C}} f(x). \]
Projected GD:
\[ x_{k+1} = \Pi_{\mathcal{C}} \left(x_k-\eta\nabla f(x_k)\right), \]
其中
\[ \Pi_{\mathcal{C}}(z) = \arg\min_{x\in\mathcal{C}}\frac{1}{2}\|x-z\|^2. \]
projection 的一阶最优条件:
\[ (z-\Pi_{\mathcal{C}}(z))^\top(x-\Pi_{\mathcal{C}}(z))\leq0, \qquad \forall x\in\mathcal{C}. \]
gradient clipping 可以看成把梯度缩放到 \(\ell_2\) ball:
\[ g_{\text{clip}} = g\cdot\min\left(1,\frac{\tau}{\|g\|_2}\right). \]
Common Projection Operators
Projection 不是一句“投回去”就结束。不同 feasible set 的 projection 复杂度差别很大,直接决定 projected GD 是否可用。
| Set \(\mathcal{C}\) | Projection \(\Pi_{\mathcal{C}}(z)\) | Cost |
|---|---|---|
| box \([l,u]^d\) | \(\min(u,\max(l,z))\) coordinatewise | \(O(d)\) |
| \(\ell_2\) ball \(\{x:\|x\|_2\leq r\}\) | \(z\min(1,r/\|z\|_2)\) | \(O(d)\) |
| simplex \(\{x:x_i\geq0,\sum_i x_i=1\}\) | \(\max(z_i-\tau,0)\) with threshold \(\tau\) | \(O(d\log d)\) |
| affine set \(\{x:Ax=b\}\) | \(z-A^\top(AA^\top)^{-1}(Az-b)\) | linear solve |
For projected gradient descent, the projected gradient mapping is \[ G_\eta(x)=\frac{1}{\eta}\left(x-\Pi_{\mathcal{C}}(x-\eta\nabla f(x))\right). \] It generalizes the gradient as a stationarity measure under constraints.
若 \(G_\eta(x)=0\),则 \(x=\Pi_{\mathcal{C}}(x-\eta\nabla f(x))\)。Projection 的一阶条件给:
\[ \nabla f(x)^\top(y-x)\geq0, \qquad \forall y\in\mathcal{C}, \]
这正是 constrained convex optimum 的一阶条件。所以训练/求解时只看 raw gradient norm 不够;约束问题应看 projected gradient mapping 或 KKT residual。
Simplex projection 是分类概率、mixture weights、attention distribution 后处理里最常见的投影之一。解的形状来自 KKT:
\[ x_i^\star=\max(z_i-\tau,0), \qquad \sum_i x_i^\star=1. \]
阈值 \(\tau\) 可以通过排序找到:
import torch
def project_simplex(z):
# Projects each row of z onto {x >= 0, sum(x) = 1}.
zs = torch.sort(z, dim=-1, descending=True).values
cssv = zs.cumsum(dim=-1) - 1
idx = torch.arange(1, z.size(-1) + 1, device=z.device)
cond = zs - cssv / idx > 0
rho = cond.sum(dim=-1, keepdim=True).clamp_min(1)
tau = cssv.gather(-1, rho - 1) / rho
return (z - tau).clamp_min(0)Projected GD is attractive only when \(\Pi_{\mathcal{C}}\) is cheap or structured. If projection requires a large dense solve every step, a penalty, barrier, or Frank-Wolfe formulation may be more practical.
Frank-Wolfe / Conditional Gradient
Projected GD 的难点是 projection:
\[ \Pi_{\mathcal{C}}(z)=\arg\min_{x\in\mathcal{C}}\|x-z\|^2. \]
如果 \(\mathcal{C}\) 是 probability simplex、nuclear norm ball、matching polytope 等复杂集合,projection 可能很贵。Frank-Wolfe 用 linear minimization oracle 替代 projection:
\[ s_k = \arg\min_{s\in\mathcal{C}} \nabla f(x_k)^\top s. \]
更新:
\[ x_{k+1} = (1-\gamma_k)x_k+\gamma_k s_k. \]
因为 \(x_k,s_k\in\mathcal{C}\) 且 \(\mathcal{C}\) convex,\(x_{k+1}\) 自动 feasible。
The Frank-Wolfe gap is \[ g_{\text{FW}}(x) = \max_{s\in\mathcal{C}} \nabla f(x)^\top(x-s). \] It is a computable optimality certificate for constrained convex minimization.
若 \(g_{\text{FW}}(x)=0\),则
\[ \nabla f(x)^\top(s-x)\geq0, \qquad \forall s\in\mathcal{C}, \]
这正是 constrained convex optimum 的一阶条件。
convex constrained optimum 的一阶必要充分条件是:
\[ \nabla f(x^\star)^\top(s-x^\star)\geq0, \qquad \forall s\in\mathcal{C}. \]
若 \(g_{\text{FW}}(x)=0\),则
\[ \max_{s\in\mathcal{C}}\nabla f(x)^\top(x-s)=0, \]
所以对所有 \(s\),
\[ \nabla f(x)^\top(x-s)\leq0, \]
等价于 \(\nabla f(x)^\top(s-x)\geq0\)。由 convexity,\(x\) 是全局最优。
Frank-Wolfe 在 ML 中常见于 sparse/structured constraints,因为每一步会朝一个 extreme point 走,天然产生稀疏组合。
Proximal Gradient
考虑 smooth + nonsmooth:
\[ \min_x F(x)=f(x)+\lambda R(x). \]
The proximal operator of \(R\) is \[ \operatorname{prox}_{\eta\lambda R}(z) = \arg\min_x \left\{ \frac{1}{2\eta}\|x-z\|^2+\lambda R(x) \right\}. \]
Proximal gradient:
\[ x_{k+1} = \operatorname{prox}_{\eta\lambda R} \left(x_k-\eta\nabla f(x_k)\right). \]
L1 Soft Thresholding
若 \(R(x)=\|x\|_1\),prox 按坐标分解:
\[ \min_u \frac{1}{2\eta}(u-z)^2+\lambda|u|. \]
解为:
\[ \operatorname{prox}_{\eta\lambda\|\cdot\|_1}(z) = \operatorname{sign}(z)\max(|z|-\eta\lambda,0). \]
对 \(u>0\),导数为:
\[ \frac{1}{\eta}(u-z)+\lambda=0 \Rightarrow u=z-\eta\lambda. \]
要求 \(z>\eta\lambda\)。对 \(u<0\):
\[ \frac{1}{\eta}(u-z)-\lambda=0 \Rightarrow u=z+\eta\lambda, \]
要求 \(z<-\eta\lambda\)。若 \(|z|\leq\eta\lambda\),最优点在 \(u=0\),因为
\[ 0\in \frac{1}{\eta}(0-z)+\lambda[-1,1]. \]
这解释了 L1 regularization 为什么产生稀疏性:小坐标会被直接压成 0。
ISTA and FISTA in Code
对 composite objective
\[ F(x)=f(x)+\lambda\|x\|_1 \]
若 \(f\) 是 \(L\)-smooth,ISTA 使用 \(\eta\leq1/L\):
\[ x_{k+1} = S_{\eta\lambda}(x_k-\eta\nabla f(x_k)). \]
这个更新不是“先普通 GD 再随便裁剪”,而是在最小化局部二次上界:
\[ x_{k+1} = \arg\min_x \left[ f(x_k)+\nabla f(x_k)^\top(x-x_k) +\frac{1}{2\eta}\|x-x_k\|^2 +\lambda\|x\|_1 \right]. \]
因此它保持了 smooth loss 的下降信息,同时用 prox 精确处理 nonsmooth regularizer。
def soft_threshold(z, tau):
return z.sign() * (z.abs() - tau).clamp_min(0)
def ista_step(x, grad, lr, lam):
return soft_threshold(x - lr * grad, lr * lam)FISTA 加速的常见形式:
\[ y_k=x_k+\frac{t_{k-1}-1}{t_k}(x_k-x_{k-1}), \qquad t_{k+1}=\frac{1+\sqrt{1+4t_k^2}}{2}, \]
然后在 \(y_k\) 上做 prox-gradient step。理论上可从 \(O(1/k)\) 提升到 \(O(1/k^2)\),但实际实现要注意 restart:如果动量让 objective 上升,常重置 \(t=1\)、\(y=x\)。
def fista_step(x, x_prev, t, grad_at_y, lr, lam):
x_next = soft_threshold(x - lr * grad_at_y, lr * lam)
t_next = 0.5 * (1.0 + (1.0 + 4.0 * t * t) ** 0.5)
beta = (t - 1.0) / t_next
y_next = x_next + beta * (x_next - x)
return x_next, y_next, t_nextComposite optimization 的 stationarity 也不能只看 \(\|\nabla f(x)\|\)。应该看 proximal gradient mapping:
\[ G_\eta^{\text{prox}}(x) = \frac{1}{\eta} \left[ x-\operatorname{prox}_{\eta\lambda R}(x-\eta\nabla f(x)) \right]. \]
当 \(G_\eta^{\text{prox}}(x)=0\) 时,
\[ 0\in \nabla f(x)+\lambda\partial R(x), \]
这正是 composite objective 的一阶最优条件。对 L1 来说,这个条件会变成:非零坐标上 gradient 抵消 \(\lambda\operatorname{sign}(x_i)\);零坐标上 gradient 落在 \([-\lambda,\lambda]\) 内。
AdamW-style weight decay implements a smooth \(\ell_2\)-like shrinkage, not exact sparsity. If the goal is true zero coefficients, use a proximal L1 step or an explicit pruning rule.
Proximal Point and Moreau Envelope
Proximal gradient 可以看成更基本的 proximal point method 的近似。对任意 convex \(F\),proximal point update 是:
\[ x_{k+1} = \arg\min_x \left[ F(x)+\frac{1}{2\eta}\|x-x_k\|^2 \right]. \]
这一步不是沿梯度走,而是求一个被二次项稳定住的完整子问题。它的最优条件是:
\[ 0\in \partial F(x_{k+1}) +\frac{1}{\eta}(x_{k+1}-x_k), \]
也就是:
\[ \frac{1}{\eta}(x_k-x_{k+1}) \in \partial F(x_{k+1}). \]
所以 proximal point 使用的是新点 \(x_{k+1}\) 处的 subgradient,属于 implicit update;普通 subgradient descent 使用旧点 \(x_k\) 处的 subgradient,属于 explicit update。implicit update 更稳定,但子问题通常更贵。
The Moreau envelope of \(F\) is \[ F_\eta(x) = \min_z \left[ F(z)+\frac{1}{2\eta}\|z-x\|^2 \right]. \] It is a smooth approximation of a possibly nonsmooth convex function.
若 \(z^\star=\operatorname{prox}_{\eta F}(x)\),Moreau envelope 的梯度为:
\[ \nabla F_\eta(x) = \frac{1}{\eta}(x-z^\star) = \frac{1}{\eta} \left(x-\operatorname{prox}_{\eta F}(x)\right). \]
这就是 proximal gradient mapping 的来源。不可微目标经过 Moreau envelope 后,有了一个可微的“平滑外壳”;prox step 给出的残差就是这个外壳的梯度。
定义
\[ \phi(x,z)=F(z)+\frac{1}{2\eta}\|z-x\|^2. \]
若 \(z^\star(x)\) 是唯一 minimizer,Danskin theorem 给:
\[ \nabla F_\eta(x) = \nabla_x \phi(x,z^\star(x)) = \frac{1}{\eta}(x-z^\star(x)). \]
直觉上,虽然 \(z^\star\) 随 \(x\) 变化,但在最优点处对 \(z\) 的一阶变化已经为零,所以求 \(x\) 的导数时只需保留显式依赖项。
Coordinate Descent
Coordinate descent 每次只更新一个坐标或一个 block:
\[ x_{k+1} = x_k+\alpha_k e_{i_k}, \]
其中 \(e_{i_k}\) 是第 \(i_k\) 个坐标方向。它适合 feature 很高维、单坐标更新很便宜、目标有 separable regularizer 的问题,例如 Lasso:
\[ \min_w \frac{1}{2n}\|Xw-y\|_2^2+\lambda\|w\|_1. \]
固定其他坐标,只更新 \(w_j\)。令 residual excluding coordinate \(j\) 为:
\[ r_j = y-\sum_{\ell\neq j}X_\ell w_\ell. \]
一维子问题:
\[ \min_u \frac{1}{2n}\|X_j u-r_j\|^2+\lambda |u|. \]
展开得到:
\[ \min_u \frac{a_j}{2}u^2-b_j u+\lambda|u|, \qquad a_j=\frac{\|X_j\|^2}{n}, \quad b_j=\frac{X_j^\top r_j}{n}. \]
解是 soft-thresholding:
\[ w_j \leftarrow \frac{1}{a_j} S_\lambda(b_j), \qquad S_\lambda(b)=\operatorname{sign}(b)\max(|b|-\lambda,0). \]
一维目标的 subgradient 条件是:
\[ 0\in a_j u-b_j+\lambda\partial |u|. \]
若 \(u>0\),则 \(a_j u-b_j+\lambda=0\),得到 \(u=(b_j-\lambda)/a_j\),要求 \(b_j>\lambda\)。
若 \(u<0\),则 \(a_j u-b_j-\lambda=0\),得到 \(u=(b_j+\lambda)/a_j\),要求 \(b_j<-\lambda\)。
若 \(|b_j|\leq\lambda\),则 \(u=0\) 满足 \(0\in -b_j+\lambda[-1,1]\)。合并三种情况得到 soft-thresholding。
Coordinate descent 的选择策略包括 cyclic、randomized、Gauss-Southwell。Randomized coordinate descent 对 convex smooth functions 有期望收敛保证;Gauss-Southwell 选择最大梯度坐标,单步进展可能更大,但需要扫描梯度。
Mirror Descent
Gradient descent 默认使用欧氏几何:
\[ \frac{1}{2}\|x-y\|_2^2. \]
Mirror descent 用 Bregman divergence 替代欧氏距离。它回答的问题是:如果变量本身不是自然生活在欧氏空间里,比如 probability simplex、positive cone、density ratio、attention distribution,那么“沿负梯度走一步”应该以什么几何度量为近?
For strictly convex differentiable \(\psi\), \[ D_\psi(x,y) = \psi(x)-\psi(y)-\nabla\psi(y)^\top(x-y). \]
这里 \(\psi\) 叫 mirror map 或 distance-generating function。\(D_\psi\) 一般不是对称距离:
\[ D_\psi(x,y)\neq D_\psi(y,x), \]
也不一定满足 triangle inequality。它的作用不是成为 metric,而是衡量“用 \(y\) 处一阶线性化近似 \(\psi\) 时,到了 \(x\) 还差多少曲率”。
如果 \(\psi(x)=\frac12\|x\|_2^2\),则
\[ D_\psi(x,y) = \frac12\|x-y\|_2^2. \]
所以普通 GD/PGD 是 mirror descent 的欧氏特例。
For differentiable strictly convex \(\psi\), \[ D_\psi(u,v)+D_\psi(v,w)-D_\psi(u,w) = (\nabla\psi(w)-\nabla\psi(v))^\top(u-v). \]
按定义展开三项:
\[ D_\psi(u,v) = \psi(u)-\psi(v)-\nabla\psi(v)^\top(u-v), \]
\[ D_\psi(v,w) = \psi(v)-\psi(w)-\nabla\psi(w)^\top(v-w), \]
\[ D_\psi(u,w) = \psi(u)-\psi(w)-\nabla\psi(w)^\top(u-w). \]
前两项相加再减第三项,所有 \(\psi(\cdot)\) 消掉,只剩:
\[ -\nabla\psi(v)^\top(u-v) -\nabla\psi(w)^\top(v-w) +\nabla\psi(w)^\top(u-w). \]
因为 \(u-w=(u-v)+(v-w)\),整理得到:
\[ (\nabla\psi(w)-\nabla\psi(v))^\top(u-v). \]
这个 identity 是 mirror descent 收敛证明里的 telescope 工具。欧氏 GD 的距离展开
\[ \|x_{k+1}-x^\star\|^2 \]
在 mirror geometry 中会被替换成
\[ D_\psi(x^\star,x_k)-D_\psi(x^\star,x_{k+1}). \]
Mirror descent:
\[ x_{k+1} = \arg\min_{x\in\mathcal{X}} \left\{ \eta\nabla f(x_k)^\top x+D_\psi(x,x_k) \right\}. \]
直觉上,第一项是对 \(f\) 的一阶线性化,第二项是“不要离 \(x_k\) 太远”的几何惩罚。和 proximal gradient 的区别在于,这里的近端项不是欧氏平方,而是由 \(\psi\) 定义的 Bregman geometry。
Dual-Space Update
若没有约束且 minimizer 在 interior,一阶条件是:
\[ 0 = \eta g_k+\nabla\psi(x_{k+1})-\nabla\psi(x_k), \qquad g_k=\nabla f(x_k). \]
因此:
\[ \nabla\psi(x_{k+1}) = \nabla\psi(x_k)-\eta g_k. \]
这就是 mirror descent 的核心:不是在 primal variable \(x\) 上直接减梯度,而是在 dual coordinate
\[ \theta=\nabla\psi(x) \]
里做普通梯度步,然后通过 inverse mirror map 回到 primal space:
\[ x_{k+1} = \nabla\psi^\star \left( \nabla\psi(x_k)-\eta g_k \right). \]
其中 \(\psi^\star\) 是 Fenchel conjugate,并且在 Legendre-type 条件下
\[ \nabla\psi^\star=(\nabla\psi)^{-1}. \]
A mirror map is a strictly convex differentiable function \(\psi\) whose gradient maps primal variables into dual coordinates. Mirror descent performs an additive gradient step in dual coordinates and maps back to the primal domain.
欧氏情况中 \(\psi(x)=\frac12\|x\|^2\),有 \(\nabla\psi(x)=x\),所以 dual coordinate 和 primal coordinate 相同:
\[ x_{k+1}=x_k-\eta g_k. \]
这说明普通 GD 隐含了一个很强的假设:所有坐标的自然尺度就是 Euclidean scale。AdaGrad、natural gradient、trust-region policy optimization、KL-regularized RL 都是在不同程度上打破这个假设。
Negative Entropy and Exponentiated Gradient
若 \(\mathcal{X}\) 是 probability simplex:
\[ \Delta^d = \left\{ x\in\mathbb{R}^d: x_i\geq0,\ \sum_i x_i=1 \right\}, \]
欧氏投影会产生 dense thresholding:
\[ \Pi_{\Delta}(z)_i=\max(z_i-\tau,0). \]
但概率分布更自然的几何是 KL divergence。选 negative entropy:
\[ \psi(x)=\sum_i x_i\log x_i, \]
则在 simplex interior 中:
\[ \nabla\psi(x)_i=1+\log x_i, \]
并且 Bregman divergence 是:
\[ D_\psi(x,y) = \sum_i x_i\log\frac{x_i}{y_i}. \]
这正是 \(\operatorname{KL}(x\|y)\)。
Negative entropy generates KL divergence as its Bregman divergence, so a mirror step on the simplex preserves positivity by multiplicative rather than additive updates.
把 negative entropy 代入 mirror descent:
\[ x_{k+1} = \arg\min_{x\in\Delta^d} \left[ \eta g_k^\top x + \sum_i x_i\log\frac{x_i}{x_{k,i}} \right]. \]
写 Lagrangian:
\[ \mathcal{L}(x,\lambda) = \eta\sum_i g_i x_i + \sum_i x_i\log\frac{x_i}{x_{k,i}} + \lambda\left(\sum_i x_i-1\right). \]
对 \(x_i\) 求导:
\[ 0 = \eta g_i+\log\frac{x_i}{x_{k,i}}+1+\lambda. \]
于是:
\[ x_i = x_{k,i}\exp(-\eta g_i-1-\lambda). \]
归一化常数由 \(\sum_i x_i=1\) 决定:
\[ x_{k+1,i} = \frac{x_{k,i}\exp(-\eta g_{k,i})} {\sum_j x_{k,j}\exp(-\eta g_{k,j})}. \]
这就是 exponentiated gradient:
\[ x_{k+1,i} \propto x_{k,i}\exp(-\eta g_{k,i}). \]
它和 additive update 的差别很大。若 \(x_{k,i}=0\),则后续仍为 \(0\);若 \(x_{k,i}>0\),则更新后仍严格为正。因此它天然保持 simplex interior,不需要每步做 Euclidean projection。
数值实现需要用 log-space,避免 \(\exp\) 溢出:
import torch
def exponentiated_gradient_step(x, grad, lr):
# x is row-wise simplex: x >= 0, x.sum(-1) = 1.
if torch.any(x.sum(dim=-1) <= 0):
raise ValueError("each row must contain positive probability mass")
log_x = torch.where(x > 0, x.log(), torch.full_like(x, -torch.inf))
logits = log_x - lr * grad
return torch.softmax(logits, dim=-1)最小 smoke tests 应该检查非负性、归一化,以及零坐标策略是否符合预期:
def check_simplex_step(x, grad, lr):
y = exponentiated_gradient_step(x, grad, lr)
assert torch.all(y >= 0)
assert torch.allclose(y.sum(dim=-1), torch.ones_like(y.sum(dim=-1)))
zero_mask = x == 0
assert torch.all(y[zero_mask] == 0)
return yExponentiated gradient preserves support. If a coordinate is exactly zero, it stays zero. For learnable mixture weights, initialize with small positive mass if every expert/class/action should remain reachable.
Mirror Descent Rate
Mirror descent 的标准收敛证明和 subgradient method 很像,只是距离项换成 \(D_\psi\)。设 \(g_k\in\partial f(x_k)\),且 \(\psi\) 对某个 norm \(\|\cdot\|\) 是 \(1\)-strongly convex:
\[ D_\psi(x,y)\geq\frac12\|x-y\|^2. \]
若 dual norm 下 \(\|g_k\|_\star\leq G\),且存在 \(R^2\geq D_\psi(x^\star,x_0)\),取 \(\eta=R/(G\sqrt{K})\),则 averaged iterate 满足:
\[ f(\bar{x}_K)-f(x^\star) \leq \frac{RG}{\sqrt{K}}. \]
由 mirror step 的最优性,对任意 \(u\in\mathcal{X}\):
\[ \eta g_k^\top(x_{k+1}-u) +D_\psi(x_{k+1},x_k) \leq D_\psi(u,x_k)-D_\psi(u,x_{k+1}). \]
取 \(u=x^\star\),并把 \(g_k^\top(x_k-x^\star)\) 拆成:
\[ g_k^\top(x_k-x^\star) = g_k^\top(x_k-x_{k+1}) +g_k^\top(x_{k+1}-x^\star). \]
第一项用 Young inequality 和 strong convexity 控制:
\[ \eta g_k^\top(x_k-x_{k+1}) \leq \frac{\eta^2}{2}\|g_k\|_\star^2 +\frac12\|x_k-x_{k+1}\|^2 \leq \frac{\eta^2G^2}{2} +D_\psi(x_{k+1},x_k). \]
代回后 \(D_\psi(x_{k+1},x_k)\) 抵消,得到:
\[ \eta\left(f(x_k)-f(x^\star)\right) \leq D_\psi(x^\star,x_k)-D_\psi(x^\star,x_{k+1}) +\frac{\eta^2G^2}{2}. \]
对 \(k\) 求和,Bregman 项 telescope,再除以 \(\eta K\),并用 Jensen inequality 得到:
\[ f(\bar{x}_K)-f(x^\star) \leq \frac{D_\psi(x^\star,x_0)}{\eta K} +\frac{\eta G^2}{2}. \]
选择 \(\eta=R/(G\sqrt{K})\) 得到 \(O(1/\sqrt{K})\)。
这说明 mirror descent 的 rate 不神秘:它和 subgradient method 同阶,但常数里的“半径”从 Euclidean distance 换成了 Bregman divergence。对于 simplex,negative entropy 给出的半径是:
\[ D_\psi(x^\star,x_0) = \operatorname{KL}(x^\star\|x_0). \]
若 \(x_0\) 是 uniform distribution,则:
\[ \operatorname{KL}(x^\star\|x_0) \leq \log d. \]
这就是 online learning 里 multiplicative weights 常出现 \(\sqrt{K\log d}\) regret 的来源,而不是 \(\sqrt{Kd}\)。当维度很高但最优分布只在 simplex 上移动时,KL geometry 比 Euclidean geometry 更贴合问题结构。
Bregman Proximal Updates
Mirror descent 也可以和 regularizer 结合,形成 Bregman proximal step:
\[ x_{k+1} = \arg\min_{x\in\mathcal{X}} \left[ \eta g_k^\top x +\eta R(x) +D_\psi(x,x_k) \right]. \]
若 \(\psi\) 是 negative entropy,\(R(x)\) 又是一个 KL penalty:
\[ R(x)=\tau\operatorname{KL}(x\|q), \]
则更新仍有闭式结构:
\[ x_{k+1} \propto \exp \left( \frac{ \log x_k-\eta g_k+\eta\tau\log q }{1+\eta\tau} \right). \]
这个式子可以理解为三股力量的几何平均:
- 保留旧分布 \(x_k\);
- 按梯度 \(\exp(-\eta g)\) 移动;
- 被 reference distribution \(q\) 拉回。
它和 LLM post-training 中的 KL-regularized policy update 有相同骨架。若 policy 是离散分布 \(\pi(\cdot\mid s)\),reward maximization 加 KL penalty 可写成:
\[ \pi^+ = \arg\max_{\pi\in\Delta} \left[ \mathbb{E}_{a\sim\pi} r(a) -\tau\operatorname{KL}(\pi\|\pi_{\text{ref}}) \right]. \]
一阶条件给:
\[ \pi^+(a) \propto \pi_{\text{ref}}(a)\exp(r(a)/\tau). \]
这就是 reward-tilted reference distribution。它解释了为什么 KL penalty 不只是“防止模型乱跑”的工程项,而是把 policy update 放在 reference policy 定义的 information geometry 中。
如果变量是概率分布,检查更新后是否非负、归一化、support 是否符合设计;如果变量是普通权重向量,检查 Euclidean norm/update ratio 更自然。错用几何会让 optimizer 看似可运行,却给出错误的 stationarity signal。
Newton Method
二阶 Taylor approximation:
\[ f(x+\Delta) \approx f(x)+\nabla f(x)^\top\Delta +\frac{1}{2}\Delta^\top\nabla^2 f(x)\Delta. \]
最小化右侧得到:
\[ \nabla^2f(x)\Delta=-\nabla f(x), \]
\[ x_{k+1}=x_k-\left(\nabla^2f(x_k)\right)^{-1}\nabla f(x_k). \]
Newton 利用曲率修正不同方向的尺度。完整 Hessian 在深度学习里几乎不可用,但二阶思想出现在 natural gradient、K-FAC、Shampoo、L-BFGS 和 adaptive preconditioning 中。
Damped Newton and Quadratic Convergence
裸 Newton step 在局部很好,但离 optimum 远时可能走出 trust region。实际 convex solver 常用 damped Newton:
\[ x_{k+1}=x_k+\alpha_k\Delta x_k, \qquad \Delta x_k = -\nabla^2 f(x_k)^{-1}\nabla f(x_k), \]
其中 \(\alpha_k\) 由 backtracking line search 选择,使 Armijo 条件成立:
\[ f(x_k+\alpha\Delta x_k) \leq f(x_k)+c\alpha\nabla f(x_k)^\top\Delta x_k, \qquad 0<c<1. \]
Newton decrement
\[ \lambda(x)^2 = \nabla f(x)^\top \nabla^2 f(x)^{-1} \nabla f(x) = -\nabla f(x)^\top\Delta x \]
衡量二次模型预测还能下降多少。当 \(\lambda(x)\) 足够小且 Hessian Lipschitz 连续时,进入局部区域后可以取 \(\alpha=1\),误差近似满足:
\[ \|x_{k+1}-x^\star\| \leq C\|x_k-x^\star\|^2. \]
这叫 quadratic convergence:正确位数大致每步翻倍。它解释了为什么小中型凸问题中二阶方法非常强,也解释了为什么大模型训练很少直接用完整 Newton:每步解线性系统和存 Hessian 太贵,只有近似二阶或矩阵-free 方法才可能落地。
Conjugate Gradient for Quadratics
对二次凸问题:
\[ \min_x f(x) = \frac{1}{2}x^\top A x-b^\top x, \qquad A\succ0, \]
最优条件是线性系统:
\[ Ax=b. \]
Gradient descent 使用 steepest descent direction \(r_k\),其中 residual 定义为负梯度:
\[ r_k=b-Ax_k=-\nabla f(x_k). \]
Conjugate gradient 选择一组 \(A\)-conjugate directions:
\[ p_i^\top A p_j=0, \qquad i\neq j. \]
更新:
\[ \alpha_k = \frac{r_k^\top r_k}{p_k^\top A p_k}, \qquad x_{k+1}=x_k+\alpha_k p_k, \]
\[ r_{k+1}=r_k-\alpha_k A p_k, \qquad \beta_k=\frac{r_{k+1}^\top r_{k+1}}{r_k^\top r_k}, \]
\[ p_{k+1}=r_{k+1}+\beta_k p_k. \]
在 exact arithmetic 中,CG 最多 \(d\) 步解出 \(d\) 维 SPD system。它的深度学习意义不在于直接替代 Adam,而在于说明“好的方向”不是普通梯度方向,而是要避免重复修正已经处理过的曲率方向。Hessian-free optimization、L-BFGS 和一些二阶预条件方法都借用了这种思想。
The \(k\)-th Krylov subspace generated by \(A\) and \(r_0\) is \[ \mathcal{K}_k(A,r_0) = \operatorname{span}\{r_0,Ar_0,\ldots,A^{k-1}r_0\}. \] Conjugate gradient finds the best quadratic minimizer over \(x_0+\mathcal{K}_k(A,r_0)\).
Interior-Point and Log Barriers
对 inequality constrained convex problem:
\[ \min_x f_0(x) \quad \text{s.t. } f_i(x)\leq0,\ i=1,\ldots,m, \]
interior-point method 把约束放进 barrier:
\[ \phi(x) = -\sum_{i=1}^m\log(-f_i(x)). \]
求解一系列 unconstrained problems:
\[ \min_x t f_0(x)+\phi(x), \]
其中 \(t\) 逐渐增大。\(t\) 小时 barrier 很强,iterate 留在 feasible interior;\(t\) 大时目标更接近原问题。
中心路径:
\[ x^\star(t) = \arg\min_x t f_0(x)+\phi(x). \]
Barrier objective 的 Newton step:
\[ \Delta x = -\left(\nabla^2(tf_0+\phi)(x)\right)^{-1} \nabla(tf_0+\phi)(x). \]
Newton decrement:
\[ \lambda(x)^2 = \nabla F(x)^\top \left(\nabla^2F(x)\right)^{-1} \nabla F(x), \]
衡量当前位置离 Newton optimum 的距离。
Log-barrier methods need an initial point with \(f_i(x)<0\). If the feasible region has no easy interior point, a phase-I problem is needed.
Interior-point 在深度学习主训练里不常用,因为 Hessian 和 strict feasibility 都贵;但它是理解 constrained optimization、KKT residual、primal-dual solvers 的核心方法。
Duality
约束问题:
\[ \min_x f(x) \quad \text{s.t. } c_i(x)\leq0. \]
Lagrangian:
\[ \mathcal{L}(x,\lambda) = f(x)+\sum_i\lambda_i c_i(x), \qquad \lambda_i\geq0. \]
dual function:
\[ g(\lambda)=\inf_x\mathcal{L}(x,\lambda). \]
对任意 feasible \(x\) 和 \(\lambda\geq0\):
\[ \mathcal{L}(x,\lambda) = f(x)+\sum_i\lambda_i c_i(x) \leq f(x). \]
所以:
\[ g(\lambda)\leq p^\star. \]
For any dual feasible \(\lambda\geq0\), \[ g(\lambda)\leq p^\star. \]
在满足 Slater condition 的 convex problems 中,strong duality 成立。GAN、RL、constrained policy optimization 和 distributionally robust optimization 中都能看到 saddle-point/duality 的影子。
Fenchel Conjugate and Primal-Dual Gaps
Fenchel conjugate 把一个 convex function 写成所有线性下界的上包络:
\[ f^\star(y) = \sup_x \left\{ y^\top x-f(x) \right\}. \]
Fenchel-Young inequality 直接来自定义:
\[ f(x)+f^\star(y)\geq x^\top y. \]
等号成立当且仅当 \(y\in\partial f(x)\)。这是很多 primal-dual algorithm 的核心。
For closed convex \(f\), \[ f(x)+f^\star(y)=x^\top y \] if and only if \(y\in\partial f(x)\).
由 conjugate 定义:
\[ f^\star(y) = \sup_u(y^\top u-f(u)) \geq y^\top x-f(x). \]
移项得到 Fenchel-Young inequality。
若等号成立,\(x\) 达到上确界,所以对所有 \(u\):
\[ y^\top x-f(x) \geq y^\top u-f(u), \]
即
\[ f(u)\geq f(x)+y^\top(u-x), \]
所以 \(y\in\partial f(x)\)。反方向由 subgradient inequality 取 supremum 即得。
看 Lasso 的对偶可以看到 gap 如何变成可计算停止准则。原问题:
\[ \min_w \frac{1}{2}\|Xw-y\|^2+\lambda\|w\|_1. \]
引入 residual \(r=Xw-y\),对偶变量 \(\alpha\) 对应约束 \(r=Xw-y\)。推导可得到对偶:
\[ \max_\alpha -\frac{1}{2}\|\alpha\|^2-y^\top\alpha \quad \text{s.t. } \|X^\top\alpha\|_\infty\leq\lambda. \]
任意 primal feasible \(w\) 和 dual feasible \(\alpha\) 都给出 gap:
\[ \operatorname{gap}(w,\alpha) = \left[ \frac{1}{2}\|Xw-y\|^2+\lambda\|w\|_1 \right] - \left[ -\frac{1}{2}\|\alpha\|^2-y^\top\alpha \right] \geq0. \]
这比只看参数变化更强:dual gap 是目标最优性的证书。很多成熟 convex solver 停止时报告的不是“loss 好像不变了”,而是 primal residual、dual residual 和 primal-dual gap。
A primal-dual gap is the difference between a primal feasible objective value and a dual feasible objective value. For convex problems with strong duality, a zero gap certifies optimality.
ADMM and Operator Splitting
很多问题可以写成:
\[ \min_{x,z} f(x)+g(z) \quad \text{s.t. } Ax+Bz=c. \]
Augmented Lagrangian:
\[ \mathcal{L}_\rho(x,z,y) = f(x)+g(z)+y^\top(Ax+Bz-c) +\frac{\rho}{2}\|Ax+Bz-c\|^2. \]
ADMM 交替更新:
\[ x^{k+1} = \arg\min_x \mathcal{L}_\rho(x,z^k,y^k), \]
\[ z^{k+1} = \arg\min_z \mathcal{L}_\rho(x^{k+1},z,y^k), \]
\[ y^{k+1} = y^k+\rho(Ax^{k+1}+Bz^{k+1}-c). \]
ADMM 的价值是 splitting:\(f\) 和 \(g\) 可以分别处理。例如 Lasso 可以拆成 smooth least squares 和 nonsmooth \(\ell_1\) prox。它比单纯 penalty method 更稳定,因为 dual variable \(y\) 会记住约束违反方向。
For constraint \(Ax+Bz=c\), the primal residual is \[ r^{k+1}=Ax^{k+1}+Bz^{k+1}-c, \] and the dual residual tracks the change induced by the \(z\) update, commonly \[ s^{k+1}=\rho A^\top B(z^{k+1}-z^k). \]
收敛诊断通常同时看 \(\|r^k\|\) 和 \(\|s^k\|\):前者说明约束是否满足,后者说明 dual/consensus 是否稳定。
Worked Example: ADMM for Lasso
Lasso:
\[ \min_x \frac{1}{2}\|Ax-b\|_2^2 +\lambda\|x\|_1 \]
把 smooth 和 nonsmooth 拆开:
\[ \min_{x,z} \frac{1}{2}\|Ax-b\|_2^2 +\lambda\|z\|_1 \quad \text{s.t. }x-z=0. \]
使用 scaled dual variable \(u=y/\rho\),ADMM 更新是:
\[ x^{k+1} = \arg\min_x \frac{1}{2}\|Ax-b\|_2^2 +\frac{\rho}{2}\|x-z^k+u^k\|_2^2, \]
这个子问题有线性系统:
\[ (A^\top A+\rho I)x^{k+1} = A^\top b+\rho(z^k-u^k). \]
\(z\) 更新是 proximal step:
\[ z^{k+1} = \arg\min_z \lambda\|z\|_1 +\frac{\rho}{2} \|x^{k+1}-z+u^k\|_2^2 = S_{\lambda/\rho}(x^{k+1}+u^k), \]
其中 soft-thresholding 为:
\[ S_\tau(a) = \operatorname{sign}(a)\max(|a|-\tau,0). \]
dual update:
\[ u^{k+1} = u^k+x^{k+1}-z^{k+1}. \]
这个例子体现 ADMM 的真正价值:\(x\) 子问题处理数据拟合和线性代数,\(z\) 子问题处理稀疏先验,dual variable 负责让两者达成一致。深度学习中不常把主训练写成 ADMM,但 LoRA 稀疏化、模型剪枝、约束蒸馏、federated/consensus optimization 都会借用这种 splitting 思想。
Method Selection Map
| Problem structure | Natural method | Why |
|---|---|---|
| smooth convex, cheap gradient | GD / accelerated GD | clean rates and simple implementation |
| strongly convex, ill-conditioned | preconditioned GD / Newton / CG | curvature correction |
| nonsmooth but proximable | proximal gradient / ADMM | separate smooth and nonsmooth pieces |
| constrained with cheap projection | projected GD | simple feasibility maintenance |
| constrained with expensive projection | Frank-Wolfe | linear oracle instead of projection |
| high-dimensional sparse features | coordinate descent | cheap coordinate updates |
| inequality constrained and smooth | interior-point | strong KKT residual control |
深度学习训练很少完整满足这些 convex assumptions,但这些方法提供了一张地图:当我们设计 AdamW、gradient clipping、LoRA regularization、RL constraints、MoE load balancing 或 distributed optimizer 时,其实都在借用这张地图上的局部工具。