0.2 Boltzmann Machine
Boltzmann Machine 把 Hopfield network 的确定性动力系统推向了概率建模。Hopfield network 的核心是“能量下降直到停在某个 attractor”;Boltzmann Machine 的核心则是“能量定义了一个概率分布,学习就是让低能量区域覆盖数据”。
From Energy to Probability
An energy-based model assigns each state \(\mathbf{x}\) an energy \(E_\theta(\mathbf{x})\) and defines a probability distribution \[ p_\theta(\mathbf{x}) =\frac{\exp(-E_\theta(\mathbf{x}))}{Z_\theta}, \qquad Z_\theta=\sum_{\mathbf{x}}\exp(-E_\theta(\mathbf{x})). \] The partition function \(Z_\theta\) normalizes the distribution.
这里 \(E_\theta(\mathbf{x})\) 越小,\(p_\theta(\mathbf{x})\) 越大。这个想法非常像物理里的 Boltzmann distribution:系统更倾向于出现在低能态,但温度和随机扰动允许它偶尔访问高能态。
Boltzmann Machine 通常包含 visible units \(\mathbf{v}\) 和 hidden units \(\mathbf{h}\):
\[ \mathbf{v}\in\{0,1\}^{D}, \qquad \mathbf{h}\in\{0,1\}^{K}. \]
visible units 对应数据,hidden units 对应未观测解释因素。能量函数可写为
\[ E(\mathbf{v},\mathbf{h}) = -\mathbf{a}^{\top}\mathbf{v} -\mathbf{b}^{\top}\mathbf{h} -\mathbf{v}^{\top}W\mathbf{h} -\frac{1}{2}\mathbf{v}^{\top}U\mathbf{v} -\frac{1}{2}\mathbf{h}^{\top}V\mathbf{h}. \]
Restricted Boltzmann Machine (RBM) 删除 visible-visible 和 hidden-hidden 连接,即令 \(U=0,V=0\),从而得到二分图结构:
\[ E(\mathbf{v},\mathbf{h}) = -\mathbf{a}^{\top}\mathbf{v} -\mathbf{b}^{\top}\mathbf{h} -\mathbf{v}^{\top}W\mathbf{h}. \]
Parameter Shapes and Sign Convention
本节使用二值 \(\{0,1\}\) 表示,而不是 Hopfield 页里的 \(\{-1,+1\}\) spin 表示。这样 RBM 条件概率会直接变成 sigmoid:
v: [B, D]
h: [B, K]
a: [D] visible bias
b: [K] hidden bias
W: [D, K]
对一个 batch,能量可以写成:
def rbm_energy(v, h, a, b, w):
# v: [B, D], h: [B, K], w: [D, K]
vbias = v @ a
hbias = h @ b
interaction = (v @ w * h).sum(dim=-1)
return -(vbias + hbias + interaction)Some texts define \(E=+\mathbf{v}^\top W\mathbf{h}-a^\top v-b^\top h\) or use \(\{-1,+1\}\) units. Always rederive the conditional probability after changing conventions; otherwise the update direction can silently flip.
Conditional Independence in RBM
RBM 之所以重要,是因为它牺牲了一部分表达力,换来了非常干净的条件分布。
For an RBM with binary units, \[ p(h_j=1\mid \mathbf{v})=\sigma\left(b_j+\sum_i W_{ij}v_i\right), \] and \[ p(v_i=1\mid \mathbf{h})=\sigma\left(a_i+\sum_j W_{ij}h_j\right). \] Thus all hidden units are conditionally independent given \(\mathbf{v}\), and all visible units are conditionally independent given \(\mathbf{h}\).
Proof
固定 \(\mathbf{v}\) 后,RBM 能量中与 \(\mathbf{h}\) 有关的项为
\[ E(\mathbf{v},\mathbf{h}) =-\sum_j h_j \left(b_j+\sum_i W_{ij}v_i\right) +\text{const}(\mathbf{v}). \]
因此
\[ p(\mathbf{h}\mid \mathbf{v}) \propto \prod_j \exp\left( h_j \left(b_j+\sum_i W_{ij}v_i\right) \right), \]
它按 \(j\) 分解。二值变量的归一化就是 sigmoid。
这也是 RBM 在深度学习早期非常受欢迎的原因:它不像一般 Boltzmann Machine 那样采样困难到不可动弹,而是可以交替采样 \(\mathbf{h}\sim p(\mathbf{h}\mid\mathbf{v})\) 和 \(\mathbf{v}\sim p(\mathbf{v}\mid\mathbf{h})\)。
Stable Conditional Sampling
实现时不要手写 1 / (1 + exp(-x)),大正数/大负数会有数值风险。PyTorch 的 torch.sigmoid 和 torch.nn.functional.softplus 会处理得更稳。
import torch
def sample_bernoulli(prob):
return torch.bernoulli(prob)
def sample_h_given_v(v, b, w):
logits = b + v @ w
prob = torch.sigmoid(logits)
return prob, sample_bernoulli(prob)
def sample_v_given_h(h, a, w):
logits = a + h @ w.T
prob = torch.sigmoid(logits)
return prob, sample_bernoulli(prob)有时训练用概率、采样用二值样本:
| Quantity | Used for |
|---|---|
| \(p(h_j=1\mid v)\) | positive-phase expectation |
| sampled \(h_j\) | Gibbs chain transition |
| \(p(v_i=1\mid h)\) | reconstruction probability |
| sampled \(v_i\) | negative sample state |
For gradients, using probabilities can reduce variance in the positive phase. For Markov-chain transitions, the state must usually be sampled, otherwise the chain becomes deterministic mean-field dynamics.
Maximum Likelihood Learning
对一个数据点 \(\mathbf{v}\),边缘概率为
\[ p_\theta(\mathbf{v}) = \sum_{\mathbf{h}} \frac{\exp(-E_\theta(\mathbf{v},\mathbf{h}))}{Z_\theta}. \]
学习目标是最大化 log-likelihood:
\[ \mathcal{L}(\theta) =\sum_{n=1}^{N}\log p_\theta(\mathbf{v}^{(n)}). \]
对权重 \(W_{ij}\) 求导得到非常重要的形式:
\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial W_{ij}} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[v_i h_j] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[v_i h_j]. \]
The positive phase raises probability around observed data by increasing correlations under \(p_\theta(\mathbf{h}\mid\mathbf{v})\). The negative phase lowers probability assigned to model-generated samples by subtracting correlations under \(p_\theta(\mathbf{v},\mathbf{h})\).
中文直觉很清楚:第一项说“数据里经常一起出现的 visible-hidden 关系要增强”;第二项说“模型自己幻想出来的样本如果过于自信,也要被拉回去”。这就是 energy-based learning 的基本张力。
Proof
对单个数据点 \(\mathbf{v}\),
\[ \log p_\theta(\mathbf{v}) = \log\sum_{\mathbf{h}}\exp(-E_\theta(\mathbf{v},\mathbf{h})) - \log Z_\theta. \]
对参数 \(\theta\) 求导:
\[ \frac{\partial}{\partial\theta} \log\sum_{\mathbf{h}}\exp(-E_\theta(\mathbf{v},\mathbf{h})) = - \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})} \left[ \frac{\partial E_\theta(\mathbf{v},\mathbf{h})}{\partial\theta} \right]. \]
同时
\[ \frac{\partial}{\partial\theta}\log Z_\theta = - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})} \left[ \frac{\partial E_\theta(\mathbf{v},\mathbf{h})}{\partial\theta} \right]. \]
因此
\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial\theta} = - \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})} \left[ \frac{\partial E_\theta}{\partial\theta} \right] + \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})} \left[ \frac{\partial E_\theta}{\partial\theta} \right]. \]
对 RBM,
\[ \frac{\partial E}{\partial W_{ij}}=-v_ih_j. \]
代入即得
\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial W_{ij}} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[v_ih_j] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[v_ih_j]. \]
Bias Gradients and Mini-Batch Estimates
同理,对 visible bias 和 hidden bias:
\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial a_i} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[v_i] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[v_i], \]
\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial b_j} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[h_j] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[h_j]. \]
对 mini-batch \(\{v^{(n)}\}_{n=1}^B\),positive phase 可以精确算出 hidden probability:
\[ P_H^+ = \sigma(\mathbf{1}b^\top + V^+ W), \]
于是
\[ \widehat{\nabla_W^+} = \frac1B (V^+)^\top P_H^+. \]
negative phase 用 Gibbs chain 得到 \(V^-,H^-\) 或概率 \(P_H^-\):
\[ \widehat{\nabla_W^-} = \frac1B (V^-)^\top P_H^-. \]
参数更新方向为
\[ \Delta W \propto \widehat{\nabla_W^+}-\widehat{\nabla_W^-}. \]
def rbm_gradients(v_pos, v_neg, a, b, w):
ph_pos = torch.sigmoid(b + v_pos @ w)
ph_neg = torch.sigmoid(b + v_neg @ w)
batch = v_pos.shape[0]
grad_w = (v_pos.T @ ph_pos - v_neg.T @ ph_neg) / batch
grad_a = (v_pos - v_neg).mean(dim=0)
grad_b = (ph_pos - ph_neg).mean(dim=0)
return grad_a, grad_b, grad_w注意这段代码是“手写学习规则”,不是标准 autograd loss。RBM 的难点正是 negative phase 需要来自模型分布的样本,而不是一个普通前向图里的监督 loss。
Partition Function and Intractability
难点在于 \(Z_\theta\):
\[ Z_\theta = \sum_{\mathbf{v},\mathbf{h}} \exp(-E_\theta(\mathbf{v},\mathbf{h})). \]
如果 visible 和 hidden 总共有 \(D+K\) 个二值变量,状态数就是 \(2^{D+K}\)。这使得精确 maximum likelihood 很快不可行。现代生成模型里很多困难都能在这里看到雏形:模型分布能写出来,但 normalization、sampling 或 likelihood gradient 很难。
Exact Partition Function for Tiny RBM
小模型可以精确枚举,用来验证实现:
def all_binary_states(n, device):
values = torch.arange(2 ** n, device=device)
bits = ((values[:, None] >> torch.arange(n, device=device)) & 1).float()
return bits
def exact_log_partition(a, b, w):
v_all = all_binary_states(a.numel(), a.device)
h_all = all_binary_states(b.numel(), b.device)
energies = []
for h in h_all:
h_batch = h.expand(v_all.shape[0], -1)
energies.append(rbm_energy(v_all, h_batch, a, b, w))
e = torch.cat(energies)
return torch.logsumexp(-e, dim=0)这个函数只能用于很小的 \(D+K\)。它的价值是测试:free energy、Gibbs sampler、CD 更新方向是否和 exact likelihood 的小规模结果一致。
Exact enumeration is a unit test, not a training method. The state count doubles with every additional binary unit.
Contrastive Divergence
Hinton 提出的 Contrastive Divergence (CD-\(k\)) 是经典近似:
- 从真实数据 \(\mathbf{v}^{(0)}\) 出发。
- 采样 \(\mathbf{h}^{(0)}\sim p(\mathbf{h}\mid \mathbf{v}^{(0)})\)。
- 交替 Gibbs sampling \(k\) 步得到 \(\mathbf{v}^{(k)},\mathbf{h}^{(k)}\)。
- 用
\[ \Delta W \propto \mathbf{v}^{(0)}{\mathbf{h}^{(0)}}^\top - \mathbf{v}^{(k)}{\mathbf{h}^{(k)}}^\top \]
近似梯度。
CD-\(k\) is biased because the negative sample is not drawn from the true model distribution unless the Markov chain has mixed. It works as a practical learning rule, but the objective it optimizes is only an approximation to maximum likelihood.
CD-k Training Step
一个手写 CD-\(k\) step:
def cd_k(v0, a, b, w, k):
v = v0
for _ in range(k):
_, h = sample_h_given_v(v, b, w)
_, v = sample_v_given_h(h, a, w)
return v
@torch.no_grad()
def rbm_cd_update(v0, a, b, w, lr, k):
vk = cd_k(v0, a, b, w, k)
grad_a, grad_b, grad_w = rbm_gradients(v0, vk, a, b, w)
a.add_(lr * grad_a)
b.add_(lr * grad_b)
w.add_(lr * grad_w)
return vk这里用 torch.no_grad() 是因为我们显式执行 RBM 学习规则,而不是通过 autograd 优化一个可微标量 loss。若把采样过程放进 autograd,Bernoulli sample 也不可直接反传。
Persistent Contrastive Divergence
CD-\(k\) 每次从数据开始,negative sample 容易太靠近数据。Persistent CD 维护一组 fantasy particles:
@torch.no_grad()
def pcd_update(v0, particles, a, b, w, lr, k):
v_neg = particles
for _ in range(k):
_, h_neg = sample_h_given_v(v_neg, b, w)
_, v_neg = sample_v_given_h(h_neg, a, w)
grad_a, grad_b, grad_w = rbm_gradients(v0, v_neg, a, b, w)
a.add_(lr * grad_a)
b.add_(lr * grad_b)
w.add_(lr * grad_w)
particles.copy_(v_neg)Persistent chain 更接近模型分布,但也更容易出现 mixing 问题:如果学习率太大,模型分布一直在移动,chains 追不上;如果能量地形有深 basin,chains 会卡住。
Gibbs Sampling as Alternating Denoising
RBM 的 block Gibbs sampling 是:
\[ \mathbf{h}^{(t)}\sim p(\mathbf{h}\mid\mathbf{v}^{(t)}), \]
\[ \mathbf{v}^{(t+1)}\sim p(\mathbf{v}\mid\mathbf{h}^{(t)}). \]
这和后来很多生成模型有相似精神:从一个 corrupted 或 model-generated state 出发,反复用条件分布修正。区别是 RBM 的条件分布来自能量函数,而 diffusion/denoising model 的条件分布通常由深网络直接预测。
Contrastive Divergence 的 \(k\) 控制 negative sample 离数据有多远:
| Method | Negative sample |
|---|---|
| CD-1 | one-step reconstruction |
| CD-k | short Markov chain from data |
| Persistent CD | chain persists across updates |
| Exact ML | true model samples after mixing |
CD-1 很像 autoencoder reconstruction pressure;Persistent CD 更接近真实 negative phase,但工程上要维护 Markov chains。
Mixing Diagnostics
Gibbs chain 是否混合,是 RBM 训练最容易被忽略的问题。常见诊断:
| Diagnostic | What to watch |
|---|---|
| reconstruction error | CD-1 是否只学会局部重建 |
| fantasy samples | persistent particles 是否多样 |
| hidden activation rate | hidden units 是否全开/全关 |
| free-energy gap | data free energy 是否低于 negative samples |
| autocorrelation | chain 是否长时间卡在同一区域 |
free-energy gap 可以写作
\[ \Delta F = \mathbb{E}_{v\sim\text{data}}F(v) - \mathbb{E}_{v\sim\text{neg}}F(v). \]
训练好的模型通常应让 data free energy 更低,即 \(\Delta F<0\)。但如果差距越来越大而 fantasy samples 退化,可能是 negative chains 没有混合,模型在“自说自话”。
Free Energy
RBM 的 hidden units 可以被解析求和,得到 visible state 的 free energy:
\[ F(\mathbf{v}) = -\mathbf{a}^{\top}\mathbf{v} -\sum_j \log \left( 1+\exp(b_j+W_{:j}^{\top}\mathbf{v}) \right). \]
于是
\[ p(\mathbf{v}) = \frac{\exp(-F(\mathbf{v}))}{Z}. \]
free energy 是理解 RBM 的好入口:hidden units 像一组 soft feature detectors;如果某些 hidden feature 能很好解释 \(\mathbf{v}\),对应的 log-sum-exp 项就会降低 \(F(\mathbf{v})\),提高数据概率。
Proof
RBM 边缘概率为
\[ p(\mathbf{v}) \propto \sum_{\mathbf{h}}\exp(-E(\mathbf{v},\mathbf{h})). \]
代入能量:
\[ -E(\mathbf{v},\mathbf{h}) = \mathbf{a}^\top\mathbf{v} + \sum_j h_j(b_j+W_{:j}^\top\mathbf{v}). \]
对 hidden units 求和,因为它们条件独立:
\[ \sum_{\mathbf{h}} \exp(-E) = \exp(\mathbf{a}^\top\mathbf{v}) \prod_j \sum_{h_j\in\{0,1\}} \exp(h_j(b_j+W_{:j}^\top\mathbf{v})). \]
二值求和给出
\[ \sum_{h_j\in\{0,1\}} \exp(h_j z_j) = 1+\exp(z_j). \]
所以
\[ \sum_{\mathbf{h}}\exp(-E) = \exp\left( \mathbf{a}^\top\mathbf{v} + \sum_j\log(1+\exp(b_j+W_{:j}^\top\mathbf{v})) \right). \]
定义 \(p(\mathbf{v})\propto \exp(-F(\mathbf{v}))\),即可得到 free energy 公式。
Stable Free-Energy Implementation
free energy 里的 \(\log(1+\exp x)\) 应使用 softplus:
\[ \operatorname{softplus}(x)=\log(1+e^x). \]
import torch.nn.functional as F
def free_energy(v, a, b, w):
hidden_logits = b + v @ w
return -(v @ a) - F.softplus(hidden_logits).sum(dim=-1)可以用它做二分类式诊断:
def free_energy_gap(v_data, v_neg, a, b, w):
return free_energy(v_data, a, b, w).mean() - free_energy(v_neg, a, b, w).mean()The free-energy gap is the difference between average free energy on data samples and average free energy on negative/model samples.
Pseudo-Likelihood
精确 likelihood 需要 \(Z\),但可以用 pseudo-likelihood 做 cheap monitoring。随机选一个 visible bit \(i\),比较原样本和翻转该 bit 后的 free energy:
\[ \log p(v_i\mid v_{\setminus i}) = \log\sigma(F(v^{\text{flip }i})-F(v)). \]
直觉:如果翻转一个真实 bit 会升高 free energy,那么模型认为原 bit 更合理。
def pseudo_likelihood(v, a, b, w, bit_idx):
v_flip = v.clone()
v_flip[:, bit_idx] = 1.0 - v_flip[:, bit_idx]
return F.logsigmoid(free_energy(v_flip, a, b, w) - free_energy(v, a, b, w))Pseudo-likelihood 不是 exact likelihood,但它不需要 partition function,适合训练中观察趋势。
Temperature and Sampling
Boltzmann distribution 常带 temperature:
\[ p_T(\mathbf{x}) = \frac{\exp(-E(\mathbf{x})/T)}{Z_T}. \]
\(T\) 大时分布更平,采样更随机;\(T\) 小时分布集中在低能状态。模拟退火会逐步降低 \(T\),先探索,再收敛。
Temperature rescales energy differences. Lower temperature sharpens the distribution around low-energy states; higher temperature smooths the distribution and encourages exploration.
这与今天 LLM decoding 的 temperature 有形式相似:都是缩放 logits/energy,从而改变采样分布的 entropy。区别是 Boltzmann Machine 的 temperature 作用在全局 state energy 上,而 LLM temperature 作用在下一 token logits 上。
Annealed Importance Sampling Intuition
若要估计 \(Z_\theta\),一个经典思路是从容易归一化的 base distribution \(p_0\) 平滑过渡到目标分布 \(p_1\)。定义中间分布:
\[ p_{\beta}(\mathbf{x}) \propto \exp\left( -(1-\beta)E_0(\mathbf{x})-\beta E_1(\mathbf{x}) \right), \qquad \beta\in[0,1]. \]
Annealed Importance Sampling (AIS) 用一串 \(\beta_0=0<\beta_1<\cdots<\beta_M=1\),在每个中间分布附近做 MCMC transition,并累计 importance weight。直觉上,直接从简单分布跳到复杂 RBM 分布太难;慢慢退火可以减少权重方差。
Annealed Importance Sampling estimates a partition-function ratio by moving samples through a sequence of intermediate distributions between an easy base model and the target model.
AIS 实现细节很多,本讲义不把它作为核心训练算法,但要知道:当论文报告 RBM likelihood 时,常常不是精确算出的,而是 AIS 估计。
RBM as a Shallow Latent Variable Model
RBM 可以看作一个浅层 latent variable model:
\[ p(\mathbf{v}) = \sum_{\mathbf{h}}p(\mathbf{v},\mathbf{h}). \]
hidden units 学到的是解释 visible correlations 的 latent factors。早期 Deep Belief Network 会逐层训练 RBM,把上一层 hidden activations 当下一层 visible data。这在今天看不再是主流,但它提出了一个重要训练范式:
- 用无监督目标预训练 representation;
- 用局部 learning rule 初始化深层模型;
- 再用 supervised fine-tuning 调整。
这条路线后来被大规模自监督预训练继承,只是模型从 RBM 换成了 Transformer,目标从 contrastive divergence 换成了 next-token 或 denoising objective。
Why This Matters for Deep Learning
Boltzmann Machine 在今天不再是主流大模型训练路线,但它留下了几个基础思想:
| Idea | Later echo |
|---|---|
| Energy function | EBMs, score matching, diffusion score networks |
| Hidden variables | VAE latent variables, representation learning |
| Positive/negative phase | Contrastive learning, noise-contrastive estimation |
| Gibbs sampling | MCMC, Langevin dynamics, denoising chains |
| Free energy | Variational bounds and normalized/unnormalized modeling |
如果说 Hopfield network 告诉我们“记忆可以是能量极小点”,Boltzmann Machine 则告诉我们“概率分布可以由能量地形定义”。后来的 VAE、GAN、diffusion、flow matching 都在某种意义上继续回答同一个问题:如何让模型学到数据分布,而不被不可解的归一化常数困住。
Implementation Checklist
实现 RBM / Boltzmann Machine 时检查:
- visible/hidden state 是
{0,1}还是{-1,+1},能量公式是否匹配; Wshape 是否为[D,K],batch matrix multiply 是否对齐;- 条件概率是否用稳定 sigmoid;
- free energy 是否用
softplus; - positive phase 是否使用 \(p(h\mid v)\) 的概率而不是无意义地增加采样噪声;
- negative phase 是否来自 Gibbs/CD/PCD,而不是直接复用数据;
- CD-\(k\) 的 \(k\) 是否足够,是否观察 fantasy samples;
- PCD particles 是否持久、是否被意外 reinitialized;
- learning rate 是否让 persistent chains 跟得上模型变化;
- hidden activation 是否塌缩到全 0 或全 1;
- free-energy gap、pseudo-likelihood、reconstruction error 是否一起看;
- tiny model 是否用 exact partition function 做过 sanity check;
- 若报告 likelihood,是否说明 exact、AIS estimate,还是 proxy metric。