2.1 Neural Network Theory Foundations
神经网络不是孤立的一堆 layer。它背后至少有三条理论线索:
- probability: 用分布、似然、隐变量描述数据生成;
- optimization: 用可微参数化模型拟合目标;
- representation: 用多层非线性把原始输入变成可用特征。
这一节把原来散落的 Gaussian、EM、Bayesian network、AutoEncoder/VAE 和 universal approximation 串成一个共同问题:我们如何用神经网络表示复杂分布或复杂函数,并通过训练让表示变得有用。
Multivariate Gaussian
A random vector \(x\in\mathbb{R}^d\) follows a Gaussian distribution \(\mathcal{N}(\mu,\Sigma)\) if its density is \[ p(x) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left( -\frac{1}{2}(x-\mu)^\top\Sigma^{-1}(x-\mu) \right). \]
\(\mu\) 控制中心,\(\Sigma\) 控制尺度和相关性。指数项里的距离:
\[ D_M(x,\mu)^2 = (x-\mu)^\top\Sigma^{-1}(x-\mu) \]
叫 Mahalanobis distance。它不是普通欧氏距离,而是先用 \(\Sigma^{-1}\) 按方差和相关性重新缩放空间。若某个方向方差很大,同样的偏移在该方向上“没那么异常”;若方差很小,则偏移会被放大。
常见特殊情况:
| Covariance | Density geometry | Use |
|---|---|---|
| \(\sigma^2I\) | isotropic sphere | diffusion noise, simple prior |
| diagonal \(\operatorname{diag}(\sigma_i^2)\) | axis-aligned ellipse | VAE encoder posterior |
| full \(\Sigma\) | rotated ellipse | correlated Gaussian model |
在扩散模型和 VAE 中,经常使用 isotropic or diagonal Gaussian,因为它有两个工程优点:采样容易,KL/likelihood 常有 closed form。
Gaussian Reparameterization
若
\[ z\sim\mathcal{N}(\mu,\operatorname{diag}(\sigma^2)), \]
可以写成:
\[ z=\mu+\sigma\odot\epsilon, \qquad \epsilon\sim\mathcal{N}(0,I). \]
这叫 reparameterization trick。随机性来自 \(\epsilon\),而 \(\mu,\sigma\) 仍在可微路径上,因此可以反向传播。
def sample_diag_gaussian(mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + std * epsThe reparameterization trick expresses a random sample as a differentiable transformation of parameters and parameter-free noise, allowing gradients to flow through stochastic latent variables.
Gaussian Negative Log-Likelihood
很多 loss 其实是概率假设的负对数似然。若模型预测均值 \(\mu_\theta(x)\),并假设 target \(y\) 满足:
\[ y\mid x \sim \mathcal{N}(\mu_\theta(x),\sigma^2I), \]
则 negative log-likelihood 是:
\[ -\log p_\theta(y\mid x) = \frac{d}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2} \|y-\mu_\theta(x)\|_2^2. \]
如果 \(\sigma^2\) 固定,前一项和系数都是常数,所以最小化 Gaussian NLL 等价于最小化 MSE。这说明 MSE 不是凭空来的;它对应一个 isotropic Gaussian observation model。
若模型同时预测 diagonal variance:
\[ y_i\mid x \sim \mathcal{N}(\mu_i(x),\sigma_i(x)^2), \]
则:
\[ -\log p_\theta(y\mid x) = \frac{1}{2} \sum_i \left[ \log(2\pi) +\log\sigma_i^2 + \frac{(y_i-\mu_i)^2}{\sigma_i^2} \right]. \]
工程里通常预测 logvar 而不是直接预测 variance:
import math
import torch
def diag_gaussian_nll(y, mu, logvar):
inv_var = torch.exp(-logvar)
nll = 0.5 * (math.log(2 * math.pi) + logvar + (y - mu).pow(2) * inv_var)
return nll.sum(dim=-1)这段公式有一个重要含义:模型可以通过增大 \(\sigma_i^2\) 降低大误差样本的 quadratic penalty,但会被 \(\log\sigma_i^2\) 惩罚。因此 variance head 学的是 uncertainty trade-off,而不是任意把 loss 缩小的免费旋钮。
Predicting variance directly can produce negative or near-zero values. Predict logvar, clamp it if needed, and monitor both residual error and predicted uncertainty.
Closed-Form Gaussian KL
VAE 和 diffusion 经常需要两个 diagonal Gaussian 的 KL。若
\[ q(z)=\mathcal{N}(\mu_q,\operatorname{diag}(\sigma_q^2)), \qquad p(z)=\mathcal{N}(\mu_p,\operatorname{diag}(\sigma_p^2)), \]
则:
\[ \operatorname{KL}(q\|p) = \frac{1}{2}\sum_i \left[ \log\frac{\sigma_{p,i}^2}{\sigma_{q,i}^2} + \frac{\sigma_{q,i}^2+(\mu_{q,i}-\mu_{p,i})^2}{\sigma_{p,i}^2} -1 \right]. \]
当 \(p=\mathcal{N}(0,I)\),就退化为后面 VAE 里的常用形式。
从定义出发:
\[ \operatorname{KL}(q\|p) = \mathbb{E}_q[\log q(z)-\log p(z)]. \]
diagonal Gaussian 的 log density 是各维相加。对每一维使用
\[ \mathbb{E}_q[(z_i-\mu_{q,i})^2]=\sigma_{q,i}^2, \]
以及
\[ \mathbb{E}_q[(z_i-\mu_{p,i})^2] = \sigma_{q,i}^2+(\mu_{q,i}-\mu_{p,i})^2, \]
代入两个 log density 后逐维整理,即得公式。
EM as Variational Optimization
很多模型包含 latent variable \(z\):
\[ p_\theta(x) = \sum_z p_\theta(x,z) \quad \text{or} \quad p_\theta(x) = \int p_\theta(x,z)\,dz. \]
直接最大化 \(\log p_\theta(x)\) 往往困难,因为 log 里面有 sum/integral。EM 的核心是引入一个辅助分布 \(q(z)\):
\[ \log p_\theta(x) = \log\int q(z)\frac{p_\theta(x,z)}{q(z)}dz. \]
由 Jensen inequality:
\[ \log p_\theta(x) \geq \mathbb{E}_{q(z)} \left[ \log p_\theta(x,z)-\log q(z) \right] = \mathcal{L}(q,\theta). \]
这就是 ELBO。
For any distribution \(q(z)\), \[ \log p_\theta(x) = \mathcal{L}(q,\theta) + \operatorname{KL}\left(q(z)\|p_\theta(z\mid x)\right). \]
从 KL 展开:
\[ \operatorname{KL}(q(z)\|p_\theta(z\mid x)) = \mathbb{E}_q[\log q(z)-\log p_\theta(z\mid x)]. \]
代入
\[ \log p_\theta(z\mid x) = \log p_\theta(x,z)-\log p_\theta(x), \]
得:
\[ \operatorname{KL} = \mathbb{E}_q[\log q(z)-\log p_\theta(x,z)] +\log p_\theta(x). \]
整理:
\[ \log p_\theta(x) = \mathbb{E}_q[\log p_\theta(x,z)-\log q(z)] + \operatorname{KL}(q\|p_\theta(z\mid x)). \]
EM 交替做两件事:
| Step | Operation | Meaning |
|---|---|---|
| E-step | \(q^{t+1}(z)=p_{\theta^t}(z\mid x)\) | make bound tight |
| M-step | \(\theta^{t+1}=\arg\max_\theta\mathcal{L}(q^{t+1},\theta)\) | optimize parameters under expected latent assignments |
神经网络里的 VAE 可以看作 amortized variational EM:不用每个样本单独求 \(q(z)\),而是用 encoder network \(q_\phi(z\mid x)\) 直接预测 variational posterior。
Worked Example: Gaussian Mixture EM
Gaussian Mixture Model 假设每个样本 \(x_n\) 来自某个隐含类别 \(z_n\in\{1,\ldots,K\}\):
\[ p(x_n,z_n=k) = \pi_k\mathcal{N}(x_n;\mu_k,\Sigma_k), \qquad \sum_k\pi_k=1. \]
观测似然是:
\[ p(x_n) = \sum_{k=1}^K \pi_k\mathcal{N}(x_n;\mu_k,\Sigma_k). \]
这里 log-likelihood 有 \(\log\sum_k\),直接优化不方便。EM 的 E-step 计算 responsibility:
\[ \gamma_{nk} = p(z_n=k\mid x_n) = \frac{ \pi_k\mathcal{N}(x_n;\mu_k,\Sigma_k) }{ \sum_{j=1}^K \pi_j\mathcal{N}(x_n;\mu_j,\Sigma_j) }. \]
M-step 用 soft assignment 更新参数。令
\[ N_k=\sum_{n=1}^N\gamma_{nk}, \]
则:
\[ \pi_k^{\text{new}}=\frac{N_k}{N}, \qquad \mu_k^{\text{new}} = \frac{1}{N_k} \sum_n\gamma_{nk}x_n, \]
\[ \Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_n \gamma_{nk} (x_n-\mu_k^{\text{new}}) (x_n-\mu_k^{\text{new}})^\top. \]
In mixture models, a responsibility \(\gamma_{nk}\) is the posterior probability that component \(k\) generated example \(x_n\) under the current parameters.
这和 neural network 训练的关系在于:EM 展示了 latent assignment 可以被当成 soft target。VAE 把 exact responsibility 换成 encoder network;MoE router 也可以看成对 expert assignment 的神经化版本,只是训练目标和约束不同。
数值实现时,responsibility 要在 log space 里算:
def gmm_responsibility(log_pi, log_prob):
# log_pi: [K], log_prob: [N, K]
logits = log_prob + log_pi[None, :]
return torch.softmax(logits, dim=-1)log_prob 本身通常通过 stable Gaussian log-density 计算。不要先算很小的 density 再相除,否则高维 Gaussian 下会下溢成 0。
Responsibilities are not ground-truth labels. They are posterior targets under the current model, so bad initialization can produce bad soft assignments.
Amortization Gap
经典 EM 的 E-step 可以为每个样本单独优化 \(q_n(z)\)。VAE 使用一个共享 encoder \(q_\phi(z\mid x)\),这叫 amortization:用一次训练好的网络快速产生 posterior approximation。
Amortized inference uses a learned function to map observations to approximate posterior parameters, replacing per-example variational optimization.
这带来两个 gap:
| Gap | Meaning |
|---|---|
| approximation gap | variational family cannot express true posterior |
| amortization gap | encoder network does not find the best member of that family for each sample |
所以 VAE 的 inference error 不只来自 Gaussian posterior 太简单,也可能来自 encoder 没有把 \(\mu_\phi(x),\sigma_\phi(x)\) 预测到最优。后来的 normalizing flow posterior、iterative inference、diffusion posterior approximation 都是在处理这些 gap。
Bayesian Networks
A Bayesian network is a directed acyclic graph whose nodes are random variables and whose joint distribution factorizes as \[ p(x_1,\ldots,x_n) = \prod_{i=1}^n p(x_i\mid \operatorname{pa}(x_i)). \]
图结构编码条件独立性。例如:
z -> x
表示:
\[ p(x,z)=p(z)p(x\mid z). \]
VAE、diffusion、autoregressive LM 都可以看作概率图模型,只是条件分布由神经网络参数化:
| Model | Factorization | Neural network role |
|---|---|---|
| VAE | \(p(z)p_\theta(x\mid z)\) | decoder gives likelihood parameters |
| autoregressive LM | \(\prod_t p_\theta(x_t\mid x_{<t})\) | Transformer gives next-token logits |
| diffusion | \(p(x_T)\prod_t p_\theta(x_{t-1}\mid x_t)\) | denoiser predicts reverse transition |
深度学习给概率图模型提供强大的非线性 conditional distribution;概率图模型给深度学习提供 latent variable、uncertainty 和 factorization language。
Conditional Independence Patterns
Bayesian network 的图不是画着好看,它定义了条件独立结构。三个最常见 pattern:
| Pattern | Graph | Independence intuition |
|---|---|---|
| chain | \(a\to b\to c\) | given \(b\), \(a\) and \(c\) are independent |
| fork | \(a\leftarrow b\to c\) | given common cause \(b\), children are independent |
| collider | \(a\to b\leftarrow c\) | conditioning on \(b\) can make \(a\) and \(c\) dependent |
例如 latent variable model:
z -> x1
z -> x2
factorization 是:
\[ p(z,x_1,x_2) = p(z)p(x_1\mid z)p(x_2\mid z). \]
它意味着:
\[ x_1\perp x_2\mid z. \]
但边缘上 \(x_1\) 和 \(x_2\) 通常不独立,因为它们共享同一个 latent cause。VAE 的 decoder 常假设:
\[ p_\theta(x\mid z) = \prod_i p_\theta(x_i\mid z), \]
也就是给定 \(z\) 后各像素/维度条件独立。这个假设让 likelihood 可计算,但也解释了为什么简单 VAE decoder 容易产生模糊样本:复杂局部相关性被压到 \(z\) 或 decoder mean 里,而 observation model 本身很弱。
Writing \(p_\theta(x\mid z)=\prod_i p_\theta(x_i\mid z)\) makes likelihood training tractable, but it assumes conditional independence that may be false for images, audio, or text.
AutoEncoder
AutoEncoder 是确定性表示学习模型:
\[ z=f_\phi(x), \qquad \hat{x}=g_\theta(z). \]
训练目标是重构:
\[ \min_{\phi,\theta} \mathbb{E}_{x\sim p_{\text{data}}} \left[ \ell(x,g_\theta(f_\phi(x))) \right]. \]
若 \(\ell=\|x-\hat{x}\|_2^2\),AE 学的是能保留重构信息的低维表示。它可以做非线性降维,但不自动保证 latent space 可采样。原因是 AE 只要求训练样本附近能解码,不要求整个 latent space 都有合理密度。
A deterministic autoencoder can reconstruct training-like inputs, but random latent samples may decode to invalid data unless the latent space is regularized.
Variational AutoEncoder
VAE 把 encoder 改成 posterior approximation:
\[ q_\phi(z\mid x) = \mathcal{N}(\mu_\phi(x),\operatorname{diag}(\sigma_\phi(x)^2)). \]
decoder 定义 likelihood:
\[ p_\theta(x\mid z). \]
目标是最大化 ELBO:
\[ \mathcal{L}_{\text{VAE}}(x) = \mathbb{E}_{q_\phi(z\mid x)} [\log p_\theta(x\mid z)] - \operatorname{KL}(q_\phi(z\mid x)\|p(z)). \]
训练时通常最小化 negative ELBO:
\[ \mathcal{J}(x) = \underbrace{ -\mathbb{E}_{q_\phi(z\mid x)} [\log p_\theta(x\mid z)] }_{\text{reconstruction / negative log likelihood}} + \underbrace{ \operatorname{KL}(q_\phi(z\mid x)\|p(z)) }_{\text{latent regularization}}. \]
若先验 \(p(z)=\mathcal{N}(0,I)\),后验为 diagonal Gaussian,则 KL 有 closed form:
\[ \operatorname{KL} \left( \mathcal{N}(\mu,\operatorname{diag}(\sigma^2)) \| \mathcal{N}(0,I) \right) = \frac{1}{2} \sum_i \left( \mu_i^2+\sigma_i^2-\log\sigma_i^2-1 \right). \]
def vae_kl(mu, logvar):
return 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1).sum(dim=-1)VAE Loss in Code
VAE loss 的实现细节在于 reduction。重构项和 KL 项通常先按样本求和,再对 batch 求平均:
import torch.nn.functional as F
def vae_loss_binary(x, logits, mu, logvar, beta):
# x/logits: [B, ...], mu/logvar: [B, Dz]
recon = F.binary_cross_entropy_with_logits(
logits,
x,
reduction="none",
).flatten(1).sum(dim=1)
kl = vae_kl(mu, logvar)
loss = recon + beta * kl
return loss.mean(), {"recon": recon.mean(), "kl": kl.mean()}这里 binary_cross_entropy_with_logits 对应 Bernoulli likelihood:
\[ p_\theta(x\mid z) = \prod_i \operatorname{Bernoulli} \left(x_i;\sigma(a_i(z))\right). \]
如果图像被缩放到 \([0,1]\),用 Bernoulli likelihood 是一种建模选择;如果数据是连续实数,用 Gaussian likelihood/MSE 更自然。两者的 loss 数值尺度不同,不能只看总 loss 比大小。
Summing reconstruction loss over pixels/features and averaging over pixels define different likelihood scales. The KL/reconstruction balance changes if reduction changes.
一个常见错误是对 reconstruction 用 .mean(),对 KL 用 .sum(),导致 latent regularization 被 batch size、图像尺寸或 latent 维度意外缩放。推荐显式写出:
per-feature loss -> per-example sum -> batch mean
VAE 与 AE 的关键差异:
| Aspect | AE | VAE |
|---|---|---|
| encoder output | point \(z\) | distribution \(q_\phi(z\mid x)\) |
| latent regularization | optional | KL to prior |
| sampling | not guaranteed meaningful | sample \(z\sim p(z)\) |
| objective | reconstruction | likelihood lower bound |
Posterior Collapse
posterior collapse 指:
\[ q_\phi(z\mid x)\approx p(z), \]
也就是 latent variable 几乎不携带关于 \(x\) 的信息。此时 KL 接近 0,decoder 忽略 \(z\),模型退化成强 decoder 的 unconditional/auto-regressive model。
从 ELBO 看,强 decoder 可以单独把 reconstruction term 做好;同时让 \(q_\phi(z\mid x)\) 靠近 \(p(z)\) 可以减少 KL penalty。两者合起来形成捷径。
常见缓解方法:
| Method | Mechanism | Risk |
|---|---|---|
| KL annealing | \(\beta\) from 0 to 1 | schedule sensitive |
| free bits | allow minimum KL per latent group | threshold tuning |
| weaker decoder | force use of \(z\) | lower reconstruction quality |
| richer prior | match complex latent structure | harder training |
| hierarchical latents | distribute information across levels | model complexity |
带 \(\beta\) 的目标:
\[ \mathcal{J}_\beta(x) = -\mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] + \beta \operatorname{KL}(q_\phi(z\mid x)\|p(z)). \]
训练初期 \(\beta\approx0\),鼓励模型先学会通过 \(z\) 重构;后期 \(\beta\to1\),再把 latent space 拉回可采样先验附近。
Mutual Information View of Collapse
令数据分布为 \(p_{\text{data}}(x)\),encoder 定义联合分布:
\[ q_\phi(x,z) = p_{\text{data}}(x)q_\phi(z\mid x). \]
aggregated posterior 是:
\[ q_\phi(z) = \int p_{\text{data}}(x)q_\phi(z\mid x)dx. \]
VAE KL 项的期望可以分解为:
\[ \mathbb{E}_{p_{\text{data}}(x)} \operatorname{KL}(q_\phi(z\mid x)\|p(z)) = I_q(x;z) + \operatorname{KL}(q_\phi(z)\|p(z)). \]
The expected VAE posterior KL equals the mutual information between data and latent variables under \(q_\phi(x,z)\) plus a KL between the aggregated posterior and the prior.
从左侧展开:
\[ \mathbb{E}_{p(x)} \mathbb{E}_{q(z\mid x)} \left[ \log\frac{q(z\mid x)}{p(z)} \right]. \]
乘除 \(q(z)\):
\[ \log\frac{q(z\mid x)}{p(z)} = \log\frac{q(z\mid x)}{q(z)} + \log\frac{q(z)}{p(z)}. \]
第一项的联合期望是 mutual information:
\[ I_q(x;z) = \mathbb{E}_{q(x,z)} \left[ \log\frac{q(z\mid x)}{q(z)} \right]. \]
第二项对 \(x\) 积掉后得到:
\[ \mathbb{E}_{q(z)} \left[ \log\frac{q(z)}{p(z)} \right] = \operatorname{KL}(q(z)\|p(z)). \]
两项相加即得。
这说明 posterior collapse 不只是“KL 小”。如果期望 KL 变小,可能是两件事同时发生:
- \(I_q(x;z)\) 小:latent 几乎不携带输入信息;
- \(q(z)\) 接近 \(p(z)\):aggregated posterior 和 prior 匹配。
生成模型希望第二件事成立,但不希望第一件事完全消失。free bits、KL annealing、skip connection 限制、weaker decoder 等方法,本质上都是在给 \(I_q(x;z)\) 留空间。
一个监控方式是按 latent group 看 KL:
def kl_per_dim(mu, logvar):
return 0.5 * (mu.pow(2) + logvar.exp() - logvar - 1)
kl_dim = kl_per_dim(mu, logvar).mean(dim=0)
active_units = (kl_dim > 0.01).sum()active_units 很少时,说明大部分 latent 维度没有承载信息。它不是严格 mutual information,但作为训练诊断很有用。
Universal Approximation
Under suitable conditions, a feed-forward neural network with a non-polynomial activation and sufficiently many hidden units can approximate any continuous function on a compact subset of \(\mathbb{R}^d\) arbitrarily well.
更形式化地说,对紧集 \(K\subset\mathbb{R}^d\) 和连续函数 \(f:K\to\mathbb{R}\),对任意 \(\epsilon>0\),存在宽度足够大的网络 \(g_\theta\),使得:
\[ \sup_{x\in K}|f(x)-g_\theta(x)|<\epsilon. \]
这个定理说明神经网络表达能力很强,但它不自动回答三个工程问题:
- 需要多少 hidden units;
- SGD 能否找到对应参数;
- 有限数据下是否泛化。
所以 UAT 是表达能力定理,不是训练成功定理,也不是泛化定理。
深度学习真正有趣的地方在于:我们不仅需要“存在一个网络能表示”,还需要这个网络结构有好的归纳偏置,能被优化器找到,并能在未见数据上泛化。
Depth, Composition, and Inductive Bias
Universal approximation 常被误读成“只要一个足够宽的 MLP 就够了”。正确理解是:宽网络可以表达,但不代表表达效率高。很多真实函数有 compositional structure:
\[ f(x) = f_3(f_2(f_1(x))). \]
深层网络把这种结构直接编码到架构里:
\[ h_1=f_1(x), \qquad h_2=f_2(h_1), \qquad y=f_3(h_2). \]
CNN 的局部连接和平移共享假设图像有局部纹理;RNN/Transformer 的序列结构假设 token 顺序重要;GNN 的 permutation equivariance 假设节点编号不应改变函数。这些都是 inductive bias。
An inductive bias is a modeling preference that makes some functions easier to represent, learn, or generalize than others before seeing the specific training data.
所以理论上“MLP 可以近似任意连续函数”并不意味着工程上该用 MLP 解决所有问题。架构选择是在表达能力、样本效率、优化稳定性和计算效率之间做取舍。
Approximation, Estimation, and Optimization Errors
训练误差可以粗略拆成三类:
| Error | Source | Reduced by |
|---|---|---|
| approximation error | model class cannot express target well | better architecture / larger model |
| estimation error | finite data causes overfit or uncertainty | more data / regularization |
| optimization error | training does not find good parameters | better optimizer / schedule / initialization |
UAT 只关心第一类,而且是在无限宽度/合适激活/紧集连续函数等条件下。深度学习实践必须同时处理三类误差。这也是为什么“模型更大”有时能解决问题,有时只会过拟合或更难优化。
Implementation Checklist
读或写理论模型时至少检查:
- 随机变量和观测变量是否区分清楚;
- joint distribution 是否能写出 factorization;
- latent posterior 是 exact、variational 还是 amortized;
- loss 是 likelihood、ELBO、重构误差还是 surrogate;
- Gaussian covariance 假设是 isotropic、diagonal 还是 full;
- reparameterization 是否保留可微路径;
- KL term 的方向和符号是否正确;
- 表达能力定理是否被误用成优化或泛化保证。
- reconstruction loss 的 reduction 是否对应想要的 likelihood 尺度;
- latent KL 是否按样本、维度、batch 明确归约;
- posterior collapse 是否用 KL、active units 或 mutual-information proxy 监控;
- factorization 假设是否隐含了不合理的条件独立性;
- 模型失败时应区分 approximation、estimation 还是 optimization error。