3.1 Convex Optimization


深度学习里的 optimization 并不等于 convex optimization:神经网络 loss 通常高度非凸,训练目标也不只是最小训练误差,而是寻找能泛化的参数区域。但凸优化仍然是理解 optimizer、regularization、duality、projection 和 convergence 的语言基础。

Optimization vs. Learning

NoteDefinition: Empirical Risk Minimization

Given a model \(f_\theta\), loss \(\ell\), and dataset \(\mathcal{D}=\{(x_i,y_i)\}_{i=1}^{n}\), empirical risk minimization solves \[ \min_{\theta}\hat{R}(\theta) = \frac{1}{n}\sum_{i=1}^{n}\ell(f_\theta(x_i),y_i). \] The learning problem cares about the population risk \(R(\theta)=\mathbb{E}_{(x,y)\sim p_{\text{data}}}\ell(f_\theta(x),y)\), not only \(\hat{R}(\theta)\).

优化关注的是“给定 objective 后如何下降”;学习关注的是“这个 objective 是否让模型在未知数据上仍然正确”。这就是为什么深度学习里训练 loss 更低不必然代表模型更好。

一个有用的分解是:

model class + loss + data distribution + optimizer + regularization

Convex optimization 主要控制的是 loss landscape 和 feasible set 的几何。如果 model class 是线性的,loss 是 convex 的,regularizer 也是 convex 的,那么 empirical objective 往往是 convex 的;如果 model class 是深层神经网络,参数到预测的映射已经非线性嵌套,整体目标通常非凸。

Problem Convex in parameters? Why
linear regression + squared loss yes quadratic objective with PSD Hessian
logistic regression + CE yes log-sum-exp structure
linear SVM + hinge loss yes max of affine functions
one-hidden-layer ReLU net generally no product/composition of trainable layers
matrix factorization generally no bilinear parameters
Transformer training no deep composition, attention, normalization

所以我们学习凸优化不是为了假装 neural net 是凸的,而是为了获得一套局部分析语言:下降方向、曲率、约束、对偶、投影、正则化和稳定性。

Convex Sets and Convex Functions

NoteDefinition: Convex Set

A set \(\mathcal{X}\) is convex if for all \(x,y\in\mathcal{X}\) and \(\lambda\in[0,1]\), \[ \lambda x+(1-\lambda)y\in\mathcal{X}. \]

NoteDefinition: Convex Function

A function \(f:\mathcal{X}\rightarrow\mathbb{R}\) on a convex set \(\mathcal{X}\) is convex if \[ f(\lambda x+(1-\lambda)y) \leq \lambda f(x)+(1-\lambda)f(y) \] for all \(x,y\in\mathcal{X}\) and \(\lambda\in[0,1]\).

几何上,convex function 的函数图像永远在 chord 下方。这个定义最重要的后果是:局部最小值就是全局最小值。

NoteDefinition: Epigraph

The epigraph of a function \(f\) is \[ \operatorname{epi}(f) = \{(x,t): f(x)\leq t\}. \] A function is convex if and only if its epigraph is a convex set.

Epigraph 视角很适合理解很多“max”形式的 loss。比如

\[ f(x)=\max_i f_i(x) \]

如果每个 \(f_i\) 都是 convex,那么 \(f\) 也是 convex,因为 \(f(x)\leq t\) 等价于所有 \(f_i(x)\leq t\),epigraph 是多个 convex epigraph 的交集。

ImportantTheorem: Local Minimum Is Global for Convex Functions

If \(f\) is convex on \(\mathcal{X}\), any local minimizer \(x^\star\) is also a global minimizer.

假设存在 \(y\) 使得 \(f(y)<f(x^\star)\)。取足够小的 \(\lambda\in(0,1)\),点 \[ z=\lambda y+(1-\lambda)x^\star \] 落在 \(x^\star\) 的局部邻域内。由凸性 \[ f(z)\leq \lambda f(y)+(1-\lambda)f(x^\star)<f(x^\star), \] 这与 \(x^\star\) 是局部最小值矛盾。

How to Prove Convexity in Practice

实际做推导时,不会每次都从定义硬证。常用规则:

Rule Statement Example
nonnegative sum \(\sum_i \alpha_i f_i\) convex if \(\alpha_i\geq0\) regularized ERM
affine composition \(f(Ax+b)\) convex if \(f\) convex linear model loss
pointwise max \(\max_i f_i(x)\) convex if each \(f_i\) convex hinge loss
norm every norm is convex \(\ell_1,\ell_2\) regularization
log-sum-exp \(\log\sum_i e^{z_i}\) convex softmax CE

对任意 norm \(\|\cdot\|\),由三角不等式和齐次性:

\[ \|\lambda x+(1-\lambda)y\| \leq \lambda\|x\|+(1-\lambda)\|y\|. \]

所以 norm 是 convex function。这就是 \(\ell_1\)\(\ell_2\) regularization 属于 convex penalties 的基础。

WarningPitfall: Composition Rules Have Conditions

If \(f\) and \(g\) are convex, \(f(g(x))\) is not automatically convex. Composition needs monotonicity or affine structure. This is one reason neural networks quickly leave the convex world.

First-Order and Second-Order Characterizations

如果 \(f\) 可微,则 convexity 等价于一阶下界:

\[ f(y)\geq f(x)+\nabla f(x)^\top(y-x). \]

这句话的意义是:凸函数在任一点的 tangent plane 都是全局 lower bound。优化算法里的 gradient step、mirror descent、Bregman divergence 都能从这里生长出来。

如果 \(f\) 二阶可微,则

\[ f\text{ is convex} \quad\Longleftrightarrow\quad \nabla^2 f(x)\succeq 0 \quad\forall x. \]

WarningPitfall: Nonnegative Coordinate Curvature Is Not Enough

For multivariate functions, checking only \(\partial^2 f/\partial x_i^2\geq 0\) is insufficient. Convexity requires the full Hessian to be positive semidefinite, including cross-coordinate curvature.

Curvature Constants as Algorithm Inputs

凸优化里经常出现三个常数:smoothness \(L\)、strong convexity \(\mu\)、condition number \(\kappa=L/\mu\)。它们不是纯理论装饰,而是在告诉 optimizer “能走多大步、不同方向曲率差多少”。

NoteDefinition: Smoothness, Strong Convexity, Condition Number

A differentiable function is \(L\)-smooth if \[ f(y)\leq f(x)+\nabla f(x)^\top(y-x)+\frac{L}{2}\|y-x\|^2. \] It is \(\mu\)-strongly convex if \[ f(y)\geq f(x)+\nabla f(x)^\top(y-x)+\frac{\mu}{2}\|y-x\|^2. \] For twice differentiable objectives, these correspond to \[ \mu I\preceq \nabla^2 f(x)\preceq LI. \] The condition number is \(\kappa=L/\mu\) when \(\mu>0\).

对 quadratic

\[ f(w)=\frac{1}{2}w^\top A w-b^\top w, \qquad A\succeq0, \]

Hessian 恒等于 \(A\)。若 \(A\) 的特征值满足

\[ 0<\lambda_{\min}(A)\leq \lambda_{\max}(A), \]

\[ \mu=\lambda_{\min}(A), \qquad L=\lambda_{\max}(A), \qquad \kappa=\frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}. \]

这就是 ill-conditioning 的几何含义:同一个 loss surface 在某些方向很陡、某些方向很平。固定学习率必须照顾最陡方向,于是在平方向进展很慢。后面讲 momentum、preconditioning、Adam、Newton、CG,本质都在处理这个曲率不均衡。

ImportantCurvature Constants Are Variable-Dependent

The same prediction problem can have different \(L,\mu,\kappa\) after feature scaling, reparameterization, whitening, or normalization. Conditioning is not only a property of the dataset; it is also a property of the chosen coordinates.

First-Order Optimality

如果 \(f\) convex 且可微,点 \(x^\star\) 是无约束全局最优点,当且仅当

\[ \nabla f(x^\star)=0. \]

为什么这足够?由一阶下界:

\[ f(y) \geq f(x^\star) + \nabla f(x^\star)^\top(y-x^\star) = f(x^\star). \]

在非凸问题中,\(\nabla f(x^\star)=0\) 只说明 stationary,不保证全局最优;在凸问题中,它直接给全局证书。

有约束时,最优性变成:

\[ \nabla f(x^\star)^\top(y-x^\star)\geq0, \qquad \forall y\in\mathcal{X}. \]

直觉是:从 \(x^\star\) 朝任何可行方向走,一阶近似都不能下降。

Nonsmooth Convex Functions and Subgradients

很多 ML objective 是 convex 但不可微的,例如 \(\ell_1\) regularization 和 hinge loss。

NoteDefinition: Subgradient

For a convex function \(f\), a vector \(g\) is a subgradient at \(x\) if \[ f(y)\geq f(x)+g^\top(y-x) \quad \forall y. \] The set of all subgradients is the subdifferential \(\partial f(x)\).

\(f(x)=|x|\)

\[ \partial |x| = \begin{cases} \{1\}, & x>0,\\ [-1,1], & x=0,\\ \{-1\}, & x<0. \end{cases} \]

对 hinge loss

\[ \ell(w;x,y)=\max(0,1-yw^\top x), \qquad y\in\{-1,1\}, \]

它是 affine functions 的 pointwise max,所以 convex。不可微点出现在 margin 正好为 \(1\) 的样本上。Subgradient method 仍然可以优化它:

\[ w_{t+1} = w_t-\eta_t g_t, \qquad g_t\in\partial \ell(w_t). \]

WarningPitfall: Nonsmooth Does Not Mean Nonconvex

ReLU, hinge loss, absolute value, and max functions can be nonsmooth yet convex in the right variables. Smoothness and convexity are different properties.

Jensen’s Inequality

ImportantTheorem: Jensen’s Inequality

For a convex function \(f\) and random variable \(X\), \[ f(\mathbb{E}[X])\leq \mathbb{E}[f(X)]. \] For weights \(\alpha_i\geq0\) with \(\sum_i\alpha_i=1\), \[ f\left(\sum_i\alpha_i x_i\right) \leq \sum_i\alpha_i f(x_i). \]

Jensen inequality 是 ELBO、variational inference、log-sum-exp bound 的共同根。比如因为 \(-\log\) 是 convex,

\[ \mathbb{E}_{Y}[-\log p(X\mid Y)] \geq -\log \mathbb{E}_{Y}[p(X\mid Y)]. \]

这正是从 latent variable model 推出 lower/upper bound 时反复出现的形状。

Worked Example: Least Squares and Ridge

线性回归的经验目标写成:

\[ f(w)=\frac{1}{2n}\|Xw-y\|_2^2, \qquad X\in\mathbb{R}^{n\times d}. \]

展开可得:

\[ f(w) = \frac{1}{2n} (w^\top X^\top Xw-2y^\top Xw+y^\top y). \]

Gradient 和 Hessian 是:

\[ \nabla f(w)=\frac{1}{n}X^\top(Xw-y), \qquad \nabla^2 f(w)=\frac{1}{n}X^\top X. \]

因为

\[ v^\top X^\top Xv=\|Xv\|_2^2\geq0, \]

所以 Hessian PSD,least squares 是 convex。它什么时候 strongly convex?当 \(X^\top X\) 满秩时:

\[ \lambda_{\min}(X^\top X)>0. \]

如果 \(d>n\) 或 feature 共线,\(X^\top X\) 奇异,存在平坦方向,最优解可能不唯一。

Ridge regression 加上 \(\ell_2\) penalty:

\[ f_\lambda(w) = \frac{1}{2n}\|Xw-y\|_2^2+\frac{\lambda}{2}\|w\|_2^2. \]

此时

\[ \nabla^2 f_\lambda(w) = \frac{1}{n}X^\top X+\lambda I \succeq \lambda I. \]

只要 \(\lambda>0\),目标就是 \(\lambda\)-strongly convex。闭式解来自 stationarity:

\[ \left(\frac{1}{n}X^\top X+\lambda I\right)w = \frac{1}{n}X^\top y, \]

也就是

\[ w^\star = (X^\top X+n\lambda I)^{-1}X^\top y. \]

WarningPitfall: Intercept Regularization Changes the Problem

If a bias/intercept is included by augmenting \(X\) with a column of ones, applying ridge to every coordinate also regularizes the intercept. Many ML libraries exclude the intercept from weight decay; the Hessian and closed-form solution then use a diagonal penalty matrix instead of \(\lambda I\).

这个例子把三件事串起来:convexity 来自 Hessian PSD;strong convexity 来自曲率下界;weight decay 不只是“防过拟合”,还改善了数值条件。

Log-Sum-Exp and Cross Entropy

Softmax cross entropy 的 convexity 来自 log-sum-exp。定义

\[ \operatorname{LSE}(z) = \log\sum_{k=1}^{K}\exp(z_k). \]

对 one-hot label \(y\),logit 为 \(z\) 的 cross entropy 是

\[ \ell(z,y) = - z_y + \log\sum_{k=1}^{K}\exp(z_k) = \operatorname{LSE}(z)-z_y. \]

\(-z_y\) 是 affine,\(\operatorname{LSE}\) 是 convex,因此 \(\ell(z,y)\) 对 logits \(z\) 是 convex。

\[ p_i=\frac{e^{z_i}}{\sum_j e^{z_j}}. \]

\[ \nabla \operatorname{LSE}(z)=p, \]

Hessian 为

\[ \nabla^2\operatorname{LSE}(z) = \operatorname{diag}(p)-pp^\top. \]

对任意向量 \(v\)

\[ v^\top(\operatorname{diag}(p)-pp^\top)v = \sum_i p_i v_i^2-\left(\sum_i p_i v_i\right)^2 = \operatorname{Var}_{i\sim p}(v_i) \geq0. \]

所以 Hessian PSD,LSE 是 convex。

这里要小心变量。Softmax CE 对 logits 是 convex;如果 logits 是线性函数 \(z=Wx+b\),则对 \(W,b\) 也是 convex。可是一旦 logits 来自深层网络 \(z=f_\theta(x)\),对 \(\theta\) 通常不再 convex。

ImportantTheorem: Binary Logistic Regression Is Convex

For labels \(y_i\in\{-1,1\}\) and linear score \(s_i=w^\top x_i\), the logistic objective \[ \sum_i \log(1+\exp(-y_i w^\top x_i)) \] is convex in \(w\).

函数

\[ \phi(u)=\log(1+\exp(-u)) \]

是 convex,因为

\[ \phi''(u) = \frac{e^u}{(1+e^u)^2} \geq0. \]

每个 margin \(u_i=y_iw^\top x_i\)\(w\) 的 affine function。Convex function 与 affine map 复合仍为 convex,所以每一项 \(\phi(y_iw^\top x_i)\)\(w\) convex,非负求和仍 convex。

Logistic Hessian and Lipschitz Constant

Binary logistic regression 的平均目标常写作:

\[ f(w)= \frac{1}{n}\sum_{i=1}^{n} \log(1+\exp(-y_i x_i^\top w)), \qquad y_i\in\{-1,1\}. \]

\[ p_i=\sigma(-y_i x_i^\top w). \]

\[ \nabla f(w) = -\frac{1}{n}\sum_i y_i p_i x_i. \]

再求二阶:

\[ \nabla^2 f(w) = \frac{1}{n} X^\top S X, \qquad S=\operatorname{diag}(p_i(1-p_i)). \]

因为 \(0\leq p_i(1-p_i)\leq1/4\),所以 Hessian PSD,并且

\[ \nabla^2 f(w) \preceq \frac{1}{4n}X^\top X. \]

因此一个全局 smoothness bound 是:

\[ L\leq \frac{1}{4n}\lambda_{\max}(X^\top X). \]

若再加 ridge:

\[ f_\lambda(w)=f(w)+\frac{\lambda}{2}\|w\|^2, \]

则 Hessian 变为

\[ \nabla^2 f_\lambda(w) = \frac{1}{n}X^\top S X+\lambda I \succeq \lambda I, \]

所以 \(\mu\geq\lambda\)。这解释了为什么 L2 regularization 同时有统计意义和优化意义:它让 objective 从“可能只有半正定曲率”变成有全局曲率下界。

Multiclass Softmax Regression

多分类线性 softmax regression 令

\[ z_i = W^\top x_i+b, \qquad p_i=\operatorname{softmax}(z_i). \]

单样本 loss 是

\[ \ell_i(W,b)=-z_{i,y_i}+\log\sum_{k}\exp z_{i,k}. \]

对 logits 的 Hessian 是

\[ C_i=\operatorname{diag}(p_i)-p_ip_i^\top\succeq0. \]

\(W\) 的二阶结构可以写成 Kronecker 形式:

\[ \nabla^2_W \ell_i = (x_i x_i^\top)\otimes C_i. \]

两个矩阵都是 PSD,Kronecker product 仍 PSD,所以线性 softmax regression 对 \(W,b\) 是 convex。这个推导比“log-sum-exp 是 convex”更工程化:它告诉你 Hessian-vector product 可以不用显式构造大 Hessian,而是利用 \(x_i\)\(C_i\) 的结构。

对单样本,logit perturbation 为

\[ \delta z = \delta W^\top x. \]

二阶变化量是

\[ \delta z^\top C\delta z = (\delta W^\top x)^\top C(\delta W^\top x). \]

\(\delta W\) 向量化后,

\[ \delta z = (I_K\otimes x^\top)\operatorname{vec}(\delta W), \]

所以 Hessian 是

\[ (I_K\otimes x)C(I_K\otimes x^\top) \]

等价地可写成 \((xx^\top)\otimes C\),取决于 vec 的排列约定。无论采用哪种布局,二次型非负,因此 PSD。

WarningPitfall: Linear Head Convexity Does Not Extend Through Features

If feature vectors \(h_i\) are fixed, softmax regression over the head weights is convex. If \(h_i=f_\phi(x_i)\) is trainable, the same loss is generally nonconvex in \(\phi\).

Constrained Optimization and Lagrangian

考虑约束优化:

\[ \begin{aligned} \min_x\quad & f(x)\\ \text{s.t.}\quad & c_i(x)\leq0,\quad i=1,\ldots,m. \end{aligned} \]

Lagrangian 为

\[ \mathcal{L}(x,\alpha) = f(x)+\sum_{i=1}^{m}\alpha_i c_i(x), \qquad \alpha_i\geq0. \]

直觉上,\(\alpha_i\) 是约束施加的“力”。如果约束没有贴边,即 \(c_i(x)<0\),那么它不该产生力,通常 \(\alpha_i=0\);如果约束 active,则 \(\alpha_i\) 调整目标梯度的方向。

NoteDefinition: KKT Conditions

For suitable convex problems, optimal \(x^\star,\alpha^\star\) satisfy \[ \nabla_x\mathcal{L}(x^\star,\alpha^\star)=0, \quad c_i(x^\star)\leq0, \quad \alpha_i^\star\geq0, \quad \alpha_i^\star c_i(x^\star)=0. \] The last condition is complementary slackness.

这四个条件分别对应四件事:

KKT part Formula Meaning
stationarity \(\nabla f(x^\star)+\sum_i\alpha_i^\star\nabla c_i(x^\star)=0\) objective gradient is balanced by active constraints
primal feasibility \(c_i(x^\star)\leq0\) candidate satisfies original constraints
dual feasibility \(\alpha_i^\star\geq0\) inequality multipliers push in the right direction
complementary slackness \(\alpha_i^\star c_i(x^\star)=0\) inactive constraints exert no force

对 convex problem,在 Slater condition 等 regularity 条件下,KKT 不只是必要条件,还是充分条件。也就是说,找到一组满足 KKT 的 primal-dual variables,就等于拿到了全局最优证书。

ImportantTheorem: KKT Sufficiency for Convex Problems

For a convex minimization problem with differentiable convex objective, convex inequality constraints, affine equality constraints, and strong duality, any primal-dual point satisfying KKT conditions is globally optimal.

\(x^\star,\alpha^\star\) 满足 KKT。对任意 feasible \(x\),因为 \(\alpha_i^\star\geq0\)\(c_i(x)\leq0\)

\[ \mathcal{L}(x,\alpha^\star) = f(x)+\sum_i\alpha_i^\star c_i(x) \leq f(x). \]

另一方面,stationarity 说明 \(x^\star\) 最小化 \(\mathcal{L}(\cdot,\alpha^\star)\)。因此

\[ \mathcal{L}(x^\star,\alpha^\star) \leq \mathcal{L}(x,\alpha^\star) \leq f(x). \]

由 complementary slackness,

\[ \mathcal{L}(x^\star,\alpha^\star) = f(x^\star)+\sum_i\alpha_i^\star c_i(x^\star) = f(x^\star). \]

所以 \(f(x^\star)\leq f(x)\) 对所有 feasible \(x\) 成立,\(x^\star\) 全局最优。

Worked KKT Example: Box-Constrained Quadratic

考虑一维问题:

\[ \min_x\ \frac{1}{2}(x-a)^2 \quad \text{s.t.}\quad 0\leq x\leq1. \]

把约束写成 \(-x\leq0\)\(x-1\leq0\),Lagrangian 是:

\[ \mathcal{L}(x,\alpha,\beta) = \frac{1}{2}(x-a)^2-\alpha x+\beta(x-1), \qquad \alpha,\beta\geq0. \]

Stationarity:

\[ x-a-\alpha+\beta=0. \]

\(0<x<1\),两个约束都 inactive,complementary slackness 给 \(\alpha=\beta=0\),所以 \(x=a\)。这只在 \(a\in(0,1)\) 时可行。若 \(a<0\),最优点在 \(x=0\),上界 inactive \(\beta=0\),stationarity 给 \(\alpha=-a>0\);若 \(a>1\),最优点在 \(x=1\),下界 inactive \(\alpha=0\),stationarity 给 \(\beta=a-1>0\)。最终:

\[ x^\star=\min(1,\max(0,a)). \]

这就是 projection/clamp 的 KKT 解释:被夹住的坐标不是“梯度消失”,而是约束 multiplier 抵消了继续往外走的梯度。

SVM Multipliers as Support-Vector Certificates

线性 hard-margin SVM 可以写成:

\[ \min_{w,b}\ \frac{1}{2}\|w\|^2 \quad \text{s.t.}\quad 1-y_i(w^\top x_i+b)\leq0. \]

Lagrangian:

\[ \mathcal{L}(w,b,\alpha) = \frac{1}{2}\|w\|^2 +\sum_i\alpha_i\left(1-y_i(w^\top x_i+b)\right). \]

Stationarity 给:

\[ w=\sum_i\alpha_i y_i x_i, \qquad \sum_i\alpha_i y_i=0. \]

Complementary slackness:

\[ \alpha_i\left(1-y_i(w^\top x_i+b)\right)=0. \]

所以只有 margin 正好为 \(1\) 的样本可以有 \(\alpha_i>0\);这些样本就是 support vectors。离决策边界很远的样本约束 slack 为负,\(\alpha_i=0\),不会出现在最终的 \(w\) 展开式里。

这个例子很重要:dual variables 在这里不是抽象符号,而是在告诉你哪些训练样本真正决定了 classifier。后面看 attention pruning、active constraints、safety margin 或 MoE load-balance penalty 时,都可以用类似的“谁在施加约束力”视角。

Weak Duality, Strong Duality, and Certificates

Lagrangian dual function 定义为

\[ g(\alpha) = \inf_x \mathcal{L}(x,\alpha), \qquad \alpha\geq0. \]

对任意 feasible \(x\) 和任意 \(\alpha\geq0\)

\[ \mathcal{L}(x,\alpha) = f(x)+\sum_i\alpha_i c_i(x) \leq f(x), \]

因为 \(c_i(x)\leq0\)。所以

\[ g(\alpha)\leq f(x) \]

对所有 feasible \(x\) 成立。特别地,

\[ \max_{\alpha\geq0}g(\alpha) \leq p^\star, \]

其中 \(p^\star\) 是 primal optimum。这叫 weak duality。

ImportantTheorem: Weak Duality

The dual objective always gives a lower bound on the primal optimum for minimization problems with inequality constraints.

Strong duality 指 dual optimum 等于 primal optimum。凸问题在 Slater condition 等正则条件下常有 strong duality。它的意义是:dual variables 不只是推导技巧,而是最优性的 certificate。

在 SVM 里,dual variables 对应 support vectors;在 constrained optimization 里,KKT multiplier 表示约束的 shadow price;在 deep learning 里,这套语言帮助理解 penalty methods、augmented Lagrangian、constrained decoding 和 safety constraints。

Penalty, Projection, and Deep Learning

Regularization 可以从 constrained optimization 角度理解。例如约束 \(\|\theta\|_2^2\leq r^2\) 的 Lagrangian 会引出 weight decay:

\[ \min_\theta \hat{R}(\theta)+\frac{\lambda}{2}\|\theta\|_2^2. \]

Projection 则直接把参数或梯度投回可行域:

\[ \operatorname{Proj}_{\mathcal{X}}(x) = \arg\min_{z\in\mathcal{X}}\|x-z\|_2. \]

gradient clipping 本质上就是把梯度投影到半径为 \(\tau\)\(\ell_2\) ball:

\[ g \leftarrow g\cdot \min\left(1,\frac{\tau}{\|g\|_2}\right). \]

所以凸优化不是“只适用于凸模型”的旧知识;它提供的是一套语言,让我们能理解深度学习中约束、惩罚、投影和稳定训练的设计。

Implementation Check: Convex Objective vs. Convex Model

在代码里,一个 convex objective 往往非常普通:

logits = x @ w + b
loss = torch.nn.functional.cross_entropy(logits, y)
loss = loss + weight_decay * w.square().sum() / 2

如果 logitsx @ w + b,这个 objective 对 w,b 是 convex。若改成:

h = torch.relu(x @ w1 + b1)
logits = h @ w2 + b2
loss = torch.nn.functional.cross_entropy(logits, y)

同一个 CE loss 仍然对 logits convex,但对 (w1,b1,w2,b2) 不再 convex。很多“loss 是 convex 的”说法漏掉了“对哪个变量”。

WarningPitfall: Always Name the Variable

Convexity is a property of a function with respect to a variable. Cross entropy is convex in logits, logistic regression is convex in linear weights, but neural network training is generally nonconvex in network parameters.

因此后面分析 optimizer 时,我们经常借用 convex optimization 的局部工具,而不是声称深度网络满足全局凸性。

Implementation Check: Curvature Without Dense Hessians

小模型可以显式构造 Hessian;真实模型通常只做 Hessian-vector product。对 binary logistic regression,HVP 可以直接从公式实现:

def logistic_loss(w, x, y, l2):
    # y in {-1, 1}
    margin = y * (x @ w)
    return torch.nn.functional.softplus(-margin).mean() + 0.5 * l2 * w.dot(w)


def logistic_hvp(w, v, x, y, l2):
    logits = x @ w
    p = torch.sigmoid(-y * logits)
    s = p * (1 - p)
    xv = x @ v
    return x.T @ (s * xv) / x.shape[0] + l2 * v

可以用随机向量检查 PSD:

hv = logistic_hvp(w, v, x, y, l2)
quad = v.dot(hv)
if quad < -1e-6:
    raise ValueError("Hessian-vector product is not PSD")

还要检查数值稳定性。Logistic loss 不应写成 torch.log(1 + torch.exp(-margin)),因为大负 margin 会 overflow;应使用 softplus(-margin)logaddexp

ImportantConvex Solver Smoke Tests

For convex ML objectives, test the variable being optimized, gradient formula, Hessian-vector PSD property, regularized/unregularized intercept behavior, reduction denominator, and numerically stable loss expression.

这些检查让“凸优化基础”变成可以落地的工程习惯:你不只是知道某个目标是 convex,还能用梯度、HVP、曲率界和稳定实现去验证它。