0.2 Boltzmann Machine


Boltzmann Machine 把 Hopfield network 的确定性动力系统推向了概率建模。Hopfield network 的核心是“能量下降直到停在某个 attractor”;Boltzmann Machine 的核心则是“能量定义了一个概率分布,学习就是让低能量区域覆盖数据”。

From Energy to Probability

NoteDefinition: Energy-Based Model

An energy-based model assigns each state \(\mathbf{x}\) an energy \(E_\theta(\mathbf{x})\) and defines a probability distribution \[ p_\theta(\mathbf{x}) =\frac{\exp(-E_\theta(\mathbf{x}))}{Z_\theta}, \qquad Z_\theta=\sum_{\mathbf{x}}\exp(-E_\theta(\mathbf{x})). \] The partition function \(Z_\theta\) normalizes the distribution.

这里 \(E_\theta(\mathbf{x})\) 越小,\(p_\theta(\mathbf{x})\) 越大。这个想法非常像物理里的 Boltzmann distribution:系统更倾向于出现在低能态,但温度和随机扰动允许它偶尔访问高能态。

Boltzmann Machine 通常包含 visible units \(\mathbf{v}\) 和 hidden units \(\mathbf{h}\)

\[ \mathbf{v}\in\{0,1\}^{D}, \qquad \mathbf{h}\in\{0,1\}^{K}. \]

visible units 对应数据,hidden units 对应未观测解释因素。能量函数可写为

\[ E(\mathbf{v},\mathbf{h}) = -\mathbf{a}^{\top}\mathbf{v} -\mathbf{b}^{\top}\mathbf{h} -\mathbf{v}^{\top}W\mathbf{h} -\frac{1}{2}\mathbf{v}^{\top}U\mathbf{v} -\frac{1}{2}\mathbf{h}^{\top}V\mathbf{h}. \]

Restricted Boltzmann Machine (RBM) 删除 visible-visible 和 hidden-hidden 连接,即令 \(U=0,V=0\),从而得到二分图结构:

\[ E(\mathbf{v},\mathbf{h}) = -\mathbf{a}^{\top}\mathbf{v} -\mathbf{b}^{\top}\mathbf{h} -\mathbf{v}^{\top}W\mathbf{h}. \]

Parameter Shapes and Sign Convention

本节使用二值 \(\{0,1\}\) 表示,而不是 Hopfield 页里的 \(\{-1,+1\}\) spin 表示。这样 RBM 条件概率会直接变成 sigmoid:

v: [B, D]
h: [B, K]
a: [D]       visible bias
b: [K]       hidden bias
W: [D, K]

对一个 batch,能量可以写成:

def rbm_energy(v, h, a, b, w):
    # v: [B, D], h: [B, K], w: [D, K]
    vbias = v @ a
    hbias = h @ b
    interaction = (v @ w * h).sum(dim=-1)
    return -(vbias + hbias + interaction)
WarningPitfall: Energy Sign Convention

Some texts define \(E=+\mathbf{v}^\top W\mathbf{h}-a^\top v-b^\top h\) or use \(\{-1,+1\}\) units. Always rederive the conditional probability after changing conventions; otherwise the update direction can silently flip.

Conditional Independence in RBM

RBM 之所以重要,是因为它牺牲了一部分表达力,换来了非常干净的条件分布。

ImportantTheorem: RBM Factorized Conditionals

For an RBM with binary units, \[ p(h_j=1\mid \mathbf{v})=\sigma\left(b_j+\sum_i W_{ij}v_i\right), \] and \[ p(v_i=1\mid \mathbf{h})=\sigma\left(a_i+\sum_j W_{ij}h_j\right). \] Thus all hidden units are conditionally independent given \(\mathbf{v}\), and all visible units are conditionally independent given \(\mathbf{h}\).

Proof

固定 \(\mathbf{v}\) 后,RBM 能量中与 \(\mathbf{h}\) 有关的项为

\[ E(\mathbf{v},\mathbf{h}) =-\sum_j h_j \left(b_j+\sum_i W_{ij}v_i\right) +\text{const}(\mathbf{v}). \]

因此

\[ p(\mathbf{h}\mid \mathbf{v}) \propto \prod_j \exp\left( h_j \left(b_j+\sum_i W_{ij}v_i\right) \right), \]

它按 \(j\) 分解。二值变量的归一化就是 sigmoid。

这也是 RBM 在深度学习早期非常受欢迎的原因:它不像一般 Boltzmann Machine 那样采样困难到不可动弹,而是可以交替采样 \(\mathbf{h}\sim p(\mathbf{h}\mid\mathbf{v})\)\(\mathbf{v}\sim p(\mathbf{v}\mid\mathbf{h})\)

Stable Conditional Sampling

实现时不要手写 1 / (1 + exp(-x)),大正数/大负数会有数值风险。PyTorch 的 torch.sigmoidtorch.nn.functional.softplus 会处理得更稳。

import torch


def sample_bernoulli(prob):
    return torch.bernoulli(prob)


def sample_h_given_v(v, b, w):
    logits = b + v @ w
    prob = torch.sigmoid(logits)
    return prob, sample_bernoulli(prob)


def sample_v_given_h(h, a, w):
    logits = a + h @ w.T
    prob = torch.sigmoid(logits)
    return prob, sample_bernoulli(prob)

有时训练用概率、采样用二值样本:

Quantity Used for
\(p(h_j=1\mid v)\) positive-phase expectation
sampled \(h_j\) Gibbs chain transition
\(p(v_i=1\mid h)\) reconstruction probability
sampled \(v_i\) negative sample state
WarningPitfall: Probability and Sample Are Different Objects

For gradients, using probabilities can reduce variance in the positive phase. For Markov-chain transitions, the state must usually be sampled, otherwise the chain becomes deterministic mean-field dynamics.

Maximum Likelihood Learning

对一个数据点 \(\mathbf{v}\),边缘概率为

\[ p_\theta(\mathbf{v}) = \sum_{\mathbf{h}} \frac{\exp(-E_\theta(\mathbf{v},\mathbf{h}))}{Z_\theta}. \]

学习目标是最大化 log-likelihood:

\[ \mathcal{L}(\theta) =\sum_{n=1}^{N}\log p_\theta(\mathbf{v}^{(n)}). \]

对权重 \(W_{ij}\) 求导得到非常重要的形式:

\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial W_{ij}} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[v_i h_j] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[v_i h_j]. \]

NoteDefinition: Positive and Negative Phases

The positive phase raises probability around observed data by increasing correlations under \(p_\theta(\mathbf{h}\mid\mathbf{v})\). The negative phase lowers probability assigned to model-generated samples by subtracting correlations under \(p_\theta(\mathbf{v},\mathbf{h})\).

中文直觉很清楚:第一项说“数据里经常一起出现的 visible-hidden 关系要增强”;第二项说“模型自己幻想出来的样本如果过于自信,也要被拉回去”。这就是 energy-based learning 的基本张力。

Proof

对单个数据点 \(\mathbf{v}\)

\[ \log p_\theta(\mathbf{v}) = \log\sum_{\mathbf{h}}\exp(-E_\theta(\mathbf{v},\mathbf{h})) - \log Z_\theta. \]

对参数 \(\theta\) 求导:

\[ \frac{\partial}{\partial\theta} \log\sum_{\mathbf{h}}\exp(-E_\theta(\mathbf{v},\mathbf{h})) = - \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})} \left[ \frac{\partial E_\theta(\mathbf{v},\mathbf{h})}{\partial\theta} \right]. \]

同时

\[ \frac{\partial}{\partial\theta}\log Z_\theta = - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})} \left[ \frac{\partial E_\theta(\mathbf{v},\mathbf{h})}{\partial\theta} \right]. \]

因此

\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial\theta} = - \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})} \left[ \frac{\partial E_\theta}{\partial\theta} \right] + \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})} \left[ \frac{\partial E_\theta}{\partial\theta} \right]. \]

对 RBM,

\[ \frac{\partial E}{\partial W_{ij}}=-v_ih_j. \]

代入即得

\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial W_{ij}} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[v_ih_j] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[v_ih_j]. \]

Bias Gradients and Mini-Batch Estimates

同理,对 visible bias 和 hidden bias:

\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial a_i} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[v_i] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[v_i], \]

\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial b_j} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[h_j] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[h_j]. \]

对 mini-batch \(\{v^{(n)}\}_{n=1}^B\),positive phase 可以精确算出 hidden probability:

\[ P_H^+ = \sigma(\mathbf{1}b^\top + V^+ W), \]

于是

\[ \widehat{\nabla_W^+} = \frac1B (V^+)^\top P_H^+. \]

negative phase 用 Gibbs chain 得到 \(V^-,H^-\) 或概率 \(P_H^-\)

\[ \widehat{\nabla_W^-} = \frac1B (V^-)^\top P_H^-. \]

参数更新方向为

\[ \Delta W \propto \widehat{\nabla_W^+}-\widehat{\nabla_W^-}. \]

def rbm_gradients(v_pos, v_neg, a, b, w):
    ph_pos = torch.sigmoid(b + v_pos @ w)
    ph_neg = torch.sigmoid(b + v_neg @ w)
    batch = v_pos.shape[0]
    grad_w = (v_pos.T @ ph_pos - v_neg.T @ ph_neg) / batch
    grad_a = (v_pos - v_neg).mean(dim=0)
    grad_b = (ph_pos - ph_neg).mean(dim=0)
    return grad_a, grad_b, grad_w

注意这段代码是“手写学习规则”,不是标准 autograd loss。RBM 的难点正是 negative phase 需要来自模型分布的样本,而不是一个普通前向图里的监督 loss。

Partition Function and Intractability

难点在于 \(Z_\theta\)

\[ Z_\theta = \sum_{\mathbf{v},\mathbf{h}} \exp(-E_\theta(\mathbf{v},\mathbf{h})). \]

如果 visible 和 hidden 总共有 \(D+K\) 个二值变量,状态数就是 \(2^{D+K}\)。这使得精确 maximum likelihood 很快不可行。现代生成模型里很多困难都能在这里看到雏形:模型分布能写出来,但 normalization、sampling 或 likelihood gradient 很难。

Exact Partition Function for Tiny RBM

小模型可以精确枚举,用来验证实现:

def all_binary_states(n, device):
    values = torch.arange(2 ** n, device=device)
    bits = ((values[:, None] >> torch.arange(n, device=device)) & 1).float()
    return bits


def exact_log_partition(a, b, w):
    v_all = all_binary_states(a.numel(), a.device)
    h_all = all_binary_states(b.numel(), b.device)
    energies = []
    for h in h_all:
        h_batch = h.expand(v_all.shape[0], -1)
        energies.append(rbm_energy(v_all, h_batch, a, b, w))
    e = torch.cat(energies)
    return torch.logsumexp(-e, dim=0)

这个函数只能用于很小的 \(D+K\)。它的价值是测试:free energy、Gibbs sampler、CD 更新方向是否和 exact likelihood 的小规模结果一致。

WarningPitfall: Tiny Exact Checks Do Not Scale

Exact enumeration is a unit test, not a training method. The state count doubles with every additional binary unit.

Contrastive Divergence

Hinton 提出的 Contrastive Divergence (CD-\(k\)) 是经典近似:

  1. 从真实数据 \(\mathbf{v}^{(0)}\) 出发。
  2. 采样 \(\mathbf{h}^{(0)}\sim p(\mathbf{h}\mid \mathbf{v}^{(0)})\)
  3. 交替 Gibbs sampling \(k\) 步得到 \(\mathbf{v}^{(k)},\mathbf{h}^{(k)}\)

\[ \Delta W \propto \mathbf{v}^{(0)}{\mathbf{h}^{(0)}}^\top - \mathbf{v}^{(k)}{\mathbf{h}^{(k)}}^\top \]

近似梯度。

WarningPitfall: CD Is Not Exact Maximum Likelihood

CD-\(k\) is biased because the negative sample is not drawn from the true model distribution unless the Markov chain has mixed. It works as a practical learning rule, but the objective it optimizes is only an approximation to maximum likelihood.

CD-k Training Step

一个手写 CD-\(k\) step:

def cd_k(v0, a, b, w, k):
    v = v0
    for _ in range(k):
        _, h = sample_h_given_v(v, b, w)
        _, v = sample_v_given_h(h, a, w)
    return v


@torch.no_grad()
def rbm_cd_update(v0, a, b, w, lr, k):
    vk = cd_k(v0, a, b, w, k)
    grad_a, grad_b, grad_w = rbm_gradients(v0, vk, a, b, w)
    a.add_(lr * grad_a)
    b.add_(lr * grad_b)
    w.add_(lr * grad_w)
    return vk

这里用 torch.no_grad() 是因为我们显式执行 RBM 学习规则,而不是通过 autograd 优化一个可微标量 loss。若把采样过程放进 autograd,Bernoulli sample 也不可直接反传。

Persistent Contrastive Divergence

CD-\(k\) 每次从数据开始,negative sample 容易太靠近数据。Persistent CD 维护一组 fantasy particles:

@torch.no_grad()
def pcd_update(v0, particles, a, b, w, lr, k):
    v_neg = particles
    for _ in range(k):
        _, h_neg = sample_h_given_v(v_neg, b, w)
        _, v_neg = sample_v_given_h(h_neg, a, w)
    grad_a, grad_b, grad_w = rbm_gradients(v0, v_neg, a, b, w)
    a.add_(lr * grad_a)
    b.add_(lr * grad_b)
    w.add_(lr * grad_w)
    particles.copy_(v_neg)

Persistent chain 更接近模型分布,但也更容易出现 mixing 问题:如果学习率太大,模型分布一直在移动,chains 追不上;如果能量地形有深 basin,chains 会卡住。

Gibbs Sampling as Alternating Denoising

RBM 的 block Gibbs sampling 是:

\[ \mathbf{h}^{(t)}\sim p(\mathbf{h}\mid\mathbf{v}^{(t)}), \]

\[ \mathbf{v}^{(t+1)}\sim p(\mathbf{v}\mid\mathbf{h}^{(t)}). \]

这和后来很多生成模型有相似精神:从一个 corrupted 或 model-generated state 出发,反复用条件分布修正。区别是 RBM 的条件分布来自能量函数,而 diffusion/denoising model 的条件分布通常由深网络直接预测。

Contrastive Divergence 的 \(k\) 控制 negative sample 离数据有多远:

Method Negative sample
CD-1 one-step reconstruction
CD-k short Markov chain from data
Persistent CD chain persists across updates
Exact ML true model samples after mixing

CD-1 很像 autoencoder reconstruction pressure;Persistent CD 更接近真实 negative phase,但工程上要维护 Markov chains。

Mixing Diagnostics

Gibbs chain 是否混合,是 RBM 训练最容易被忽略的问题。常见诊断:

Diagnostic What to watch
reconstruction error CD-1 是否只学会局部重建
fantasy samples persistent particles 是否多样
hidden activation rate hidden units 是否全开/全关
free-energy gap data free energy 是否低于 negative samples
autocorrelation chain 是否长时间卡在同一区域

free-energy gap 可以写作

\[ \Delta F = \mathbb{E}_{v\sim\text{data}}F(v) - \mathbb{E}_{v\sim\text{neg}}F(v). \]

训练好的模型通常应让 data free energy 更低,即 \(\Delta F<0\)。但如果差距越来越大而 fantasy samples 退化,可能是 negative chains 没有混合,模型在“自说自话”。

Free Energy

RBM 的 hidden units 可以被解析求和,得到 visible state 的 free energy:

\[ F(\mathbf{v}) = -\mathbf{a}^{\top}\mathbf{v} -\sum_j \log \left( 1+\exp(b_j+W_{:j}^{\top}\mathbf{v}) \right). \]

于是

\[ p(\mathbf{v}) = \frac{\exp(-F(\mathbf{v}))}{Z}. \]

free energy 是理解 RBM 的好入口:hidden units 像一组 soft feature detectors;如果某些 hidden feature 能很好解释 \(\mathbf{v}\),对应的 log-sum-exp 项就会降低 \(F(\mathbf{v})\),提高数据概率。

Proof

RBM 边缘概率为

\[ p(\mathbf{v}) \propto \sum_{\mathbf{h}}\exp(-E(\mathbf{v},\mathbf{h})). \]

代入能量:

\[ -E(\mathbf{v},\mathbf{h}) = \mathbf{a}^\top\mathbf{v} + \sum_j h_j(b_j+W_{:j}^\top\mathbf{v}). \]

对 hidden units 求和,因为它们条件独立:

\[ \sum_{\mathbf{h}} \exp(-E) = \exp(\mathbf{a}^\top\mathbf{v}) \prod_j \sum_{h_j\in\{0,1\}} \exp(h_j(b_j+W_{:j}^\top\mathbf{v})). \]

二值求和给出

\[ \sum_{h_j\in\{0,1\}} \exp(h_j z_j) = 1+\exp(z_j). \]

所以

\[ \sum_{\mathbf{h}}\exp(-E) = \exp\left( \mathbf{a}^\top\mathbf{v} + \sum_j\log(1+\exp(b_j+W_{:j}^\top\mathbf{v})) \right). \]

定义 \(p(\mathbf{v})\propto \exp(-F(\mathbf{v}))\),即可得到 free energy 公式。

Stable Free-Energy Implementation

free energy 里的 \(\log(1+\exp x)\) 应使用 softplus:

\[ \operatorname{softplus}(x)=\log(1+e^x). \]

import torch.nn.functional as F


def free_energy(v, a, b, w):
    hidden_logits = b + v @ w
    return -(v @ a) - F.softplus(hidden_logits).sum(dim=-1)

可以用它做二分类式诊断:

def free_energy_gap(v_data, v_neg, a, b, w):
    return free_energy(v_data, a, b, w).mean() - free_energy(v_neg, a, b, w).mean()
NoteDefinition: Free-Energy Gap

The free-energy gap is the difference between average free energy on data samples and average free energy on negative/model samples.

Pseudo-Likelihood

精确 likelihood 需要 \(Z\),但可以用 pseudo-likelihood 做 cheap monitoring。随机选一个 visible bit \(i\),比较原样本和翻转该 bit 后的 free energy:

\[ \log p(v_i\mid v_{\setminus i}) = \log\sigma(F(v^{\text{flip }i})-F(v)). \]

直觉:如果翻转一个真实 bit 会升高 free energy,那么模型认为原 bit 更合理。

def pseudo_likelihood(v, a, b, w, bit_idx):
    v_flip = v.clone()
    v_flip[:, bit_idx] = 1.0 - v_flip[:, bit_idx]
    return F.logsigmoid(free_energy(v_flip, a, b, w) - free_energy(v, a, b, w))

Pseudo-likelihood 不是 exact likelihood,但它不需要 partition function,适合训练中观察趋势。

Temperature and Sampling

Boltzmann distribution 常带 temperature:

\[ p_T(\mathbf{x}) = \frac{\exp(-E(\mathbf{x})/T)}{Z_T}. \]

\(T\) 大时分布更平,采样更随机;\(T\) 小时分布集中在低能状态。模拟退火会逐步降低 \(T\),先探索,再收敛。

NoteDefinition: Temperature in Energy Models

Temperature rescales energy differences. Lower temperature sharpens the distribution around low-energy states; higher temperature smooths the distribution and encourages exploration.

这与今天 LLM decoding 的 temperature 有形式相似:都是缩放 logits/energy,从而改变采样分布的 entropy。区别是 Boltzmann Machine 的 temperature 作用在全局 state energy 上,而 LLM temperature 作用在下一 token logits 上。

Annealed Importance Sampling Intuition

若要估计 \(Z_\theta\),一个经典思路是从容易归一化的 base distribution \(p_0\) 平滑过渡到目标分布 \(p_1\)。定义中间分布:

\[ p_{\beta}(\mathbf{x}) \propto \exp\left( -(1-\beta)E_0(\mathbf{x})-\beta E_1(\mathbf{x}) \right), \qquad \beta\in[0,1]. \]

Annealed Importance Sampling (AIS) 用一串 \(\beta_0=0<\beta_1<\cdots<\beta_M=1\),在每个中间分布附近做 MCMC transition,并累计 importance weight。直觉上,直接从简单分布跳到复杂 RBM 分布太难;慢慢退火可以减少权重方差。

NoteDefinition: Annealed Importance Sampling

Annealed Importance Sampling estimates a partition-function ratio by moving samples through a sequence of intermediate distributions between an easy base model and the target model.

AIS 实现细节很多,本讲义不把它作为核心训练算法,但要知道:当论文报告 RBM likelihood 时,常常不是精确算出的,而是 AIS 估计。

RBM as a Shallow Latent Variable Model

RBM 可以看作一个浅层 latent variable model:

\[ p(\mathbf{v}) = \sum_{\mathbf{h}}p(\mathbf{v},\mathbf{h}). \]

hidden units 学到的是解释 visible correlations 的 latent factors。早期 Deep Belief Network 会逐层训练 RBM,把上一层 hidden activations 当下一层 visible data。这在今天看不再是主流,但它提出了一个重要训练范式:

  1. 用无监督目标预训练 representation;
  2. 用局部 learning rule 初始化深层模型;
  3. 再用 supervised fine-tuning 调整。

这条路线后来被大规模自监督预训练继承,只是模型从 RBM 换成了 Transformer,目标从 contrastive divergence 换成了 next-token 或 denoising objective。

Why This Matters for Deep Learning

Boltzmann Machine 在今天不再是主流大模型训练路线,但它留下了几个基础思想:

Idea Later echo
Energy function EBMs, score matching, diffusion score networks
Hidden variables VAE latent variables, representation learning
Positive/negative phase Contrastive learning, noise-contrastive estimation
Gibbs sampling MCMC, Langevin dynamics, denoising chains
Free energy Variational bounds and normalized/unnormalized modeling

如果说 Hopfield network 告诉我们“记忆可以是能量极小点”,Boltzmann Machine 则告诉我们“概率分布可以由能量地形定义”。后来的 VAE、GAN、diffusion、flow matching 都在某种意义上继续回答同一个问题:如何让模型学到数据分布,而不被不可解的归一化常数困住。

Implementation Checklist

实现 RBM / Boltzmann Machine 时检查:

  1. visible/hidden state 是 {0,1} 还是 {-1,+1},能量公式是否匹配;
  2. W shape 是否为 [D,K],batch matrix multiply 是否对齐;
  3. 条件概率是否用稳定 sigmoid;
  4. free energy 是否用 softplus
  5. positive phase 是否使用 \(p(h\mid v)\) 的概率而不是无意义地增加采样噪声;
  6. negative phase 是否来自 Gibbs/CD/PCD,而不是直接复用数据;
  7. CD-\(k\)\(k\) 是否足够,是否观察 fantasy samples;
  8. PCD particles 是否持久、是否被意外 reinitialized;
  9. learning rate 是否让 persistent chains 跟得上模型变化;
  10. hidden activation 是否塌缩到全 0 或全 1;
  11. free-energy gap、pseudo-likelihood、reconstruction error 是否一起看;
  12. tiny model 是否用 exact partition function 做过 sanity check;
  13. 若报告 likelihood,是否说明 exact、AIS estimate,还是 proxy metric。