0.2 Boltzmann Machine

Boltzmann Machine 把 Hopfield network 的确定性动力系统推向了概率建模。Hopfield network 的核心是“能量下降直到停在某个 attractor”；Boltzmann Machine 的核心则是“能量定义了一个概率分布，学习就是让低能量区域覆盖数据”。

From Energy to Probability

Definition: Energy-Based Model

An energy-based model assigns each state \(\mathbf{x}\) an energy \(E_\theta(\mathbf{x})\) and defines a probability distribution \[ p_\theta(\mathbf{x}) =\frac{\exp(-E_\theta(\mathbf{x}))}{Z_\theta}, \qquad Z_\theta=\sum_{\mathbf{x}}\exp(-E_\theta(\mathbf{x})). \] The partition function \(Z_\theta\) normalizes the distribution.

这里 \(E_\theta(\mathbf{x})\) 越小，\(p_\theta(\mathbf{x})\) 越大。这个想法非常像物理里的 Boltzmann distribution：系统更倾向于出现在低能态，但温度和随机扰动允许它偶尔访问高能态。

Boltzmann Machine 通常包含 visible units \(\mathbf{v}\) 和 hidden units \(\mathbf{h}\)：

\[ \mathbf{v}\in\{0,1\}^{D}, \qquad \mathbf{h}\in\{0,1\}^{K}. \]

visible units 对应数据，hidden units 对应未观测解释因素。能量函数可写为

\[ E(\mathbf{v},\mathbf{h}) = -\mathbf{a}^{\top}\mathbf{v} -\mathbf{b}^{\top}\mathbf{h} -\mathbf{v}^{\top}W\mathbf{h} -\frac{1}{2}\mathbf{v}^{\top}U\mathbf{v} -\frac{1}{2}\mathbf{h}^{\top}V\mathbf{h}. \]

Restricted Boltzmann Machine (RBM) 删除 visible-visible 和 hidden-hidden 连接，即令 \(U=0,V=0\)，从而得到二分图结构：

\[ E(\mathbf{v},\mathbf{h}) = -\mathbf{a}^{\top}\mathbf{v} -\mathbf{b}^{\top}\mathbf{h} -\mathbf{v}^{\top}W\mathbf{h}. \]

Parameter Shapes and Sign Convention

本节使用二值 \(\{0,1\}\) 表示，而不是 Hopfield 页里的 \(\{-1,+1\}\) spin 表示。这样 RBM 条件概率会直接变成 sigmoid：

v: [B, D]
h: [B, K]
a: [D]       visible bias
b: [K]       hidden bias
W: [D, K]

对一个 batch，能量可以写成：

def rbm_energy(v, h, a, b, w):
    # v: [B, D], h: [B, K], w: [D, K]
    vbias = v @ a
    hbias = h @ b
    interaction = (v @ w * h).sum(dim=-1)
    return -(vbias + hbias + interaction)

Pitfall: Energy Sign Convention

Some texts define \(E=+\mathbf{v}^\top W\mathbf{h}-a^\top v-b^\top h\) or use \(\{-1,+1\}\) units. Always rederive the conditional probability after changing conventions; otherwise the update direction can silently flip.

Conditional Independence in RBM

RBM 之所以重要，是因为它牺牲了一部分表达力，换来了非常干净的条件分布。

Theorem: RBM Factorized Conditionals

For an RBM with binary units, \[ p(h_j=1\mid \mathbf{v})=\sigma\left(b_j+\sum_i W_{ij}v_i\right), \] and \[ p(v_i=1\mid \mathbf{h})=\sigma\left(a_i+\sum_j W_{ij}h_j\right). \] Thus all hidden units are conditionally independent given \(\mathbf{v}\), and all visible units are conditionally independent given \(\mathbf{h}\).

Proof

固定 \(\mathbf{v}\) 后，RBM 能量中与 \(\mathbf{h}\) 有关的项为

\[ E(\mathbf{v},\mathbf{h}) =-\sum_j h_j \left(b_j+\sum_i W_{ij}v_i\right) +\text{const}(\mathbf{v}). \]

因此

\[ p(\mathbf{h}\mid \mathbf{v}) \propto \prod_j \exp\left( h_j \left(b_j+\sum_i W_{ij}v_i\right) \right), \]

它按 \(j\) 分解。二值变量的归一化就是 sigmoid。

这也是 RBM 在深度学习早期非常受欢迎的原因：它不像一般 Boltzmann Machine 那样采样困难到不可动弹，而是可以交替采样 \(\mathbf{h}\sim p(\mathbf{h}\mid\mathbf{v})\) 和 \(\mathbf{v}\sim p(\mathbf{v}\mid\mathbf{h})\)。

Stable Conditional Sampling

实现时不要手写 1 / (1 + exp(-x))，大正数/大负数会有数值风险。PyTorch 的 torch.sigmoid 和 torch.nn.functional.softplus 会处理得更稳。

import torch


def sample_bernoulli(prob):
    return torch.bernoulli(prob)


def sample_h_given_v(v, b, w):
    logits = b + v @ w
    prob = torch.sigmoid(logits)
    return prob, sample_bernoulli(prob)


def sample_v_given_h(h, a, w):
    logits = a + h @ w.T
    prob = torch.sigmoid(logits)
    return prob, sample_bernoulli(prob)

有时训练用概率、采样用二值样本：

Quantity	Used for
\(p(h_j=1\mid v)\)	positive-phase expectation
sampled \(h_j\)	Gibbs chain transition
\(p(v_i=1\mid h)\)	reconstruction probability
sampled \(v_i\)	negative sample state

Pitfall: Probability and Sample Are Different Objects

For gradients, using probabilities can reduce variance in the positive phase. For Markov-chain transitions, the state must usually be sampled, otherwise the chain becomes deterministic mean-field dynamics.

Maximum Likelihood Learning

对一个数据点 \(\mathbf{v}\)，边缘概率为

\[ p_\theta(\mathbf{v}) = \sum_{\mathbf{h}} \frac{\exp(-E_\theta(\mathbf{v},\mathbf{h}))}{Z_\theta}. \]

学习目标是最大化 log-likelihood：

\[ \mathcal{L}(\theta) =\sum_{n=1}^{N}\log p_\theta(\mathbf{v}^{(n)}). \]

对权重 \(W_{ij}\) 求导得到非常重要的形式：

\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial W_{ij}} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[v_i h_j] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[v_i h_j]. \]

Definition: Positive and Negative Phases

The positive phase raises probability around observed data by increasing correlations under \(p_\theta(\mathbf{h}\mid\mathbf{v})\). The negative phase lowers probability assigned to model-generated samples by subtracting correlations under \(p_\theta(\mathbf{v},\mathbf{h})\).

中文直觉很清楚：第一项说“数据里经常一起出现的 visible-hidden 关系要增强”；第二项说“模型自己幻想出来的样本如果过于自信，也要被拉回去”。这就是 energy-based learning 的基本张力。

Proof

对单个数据点 \(\mathbf{v}\)，

\[ \log p_\theta(\mathbf{v}) = \log\sum_{\mathbf{h}}\exp(-E_\theta(\mathbf{v},\mathbf{h})) - \log Z_\theta. \]

对参数 \(\theta\) 求导：

\[ \frac{\partial}{\partial\theta} \log\sum_{\mathbf{h}}\exp(-E_\theta(\mathbf{v},\mathbf{h})) = - \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})} \left[ \frac{\partial E_\theta(\mathbf{v},\mathbf{h})}{\partial\theta} \right]. \]

同时

\[ \frac{\partial}{\partial\theta}\log Z_\theta = - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})} \left[ \frac{\partial E_\theta(\mathbf{v},\mathbf{h})}{\partial\theta} \right]. \]

因此

\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial\theta} = - \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})} \left[ \frac{\partial E_\theta}{\partial\theta} \right] + \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})} \left[ \frac{\partial E_\theta}{\partial\theta} \right]. \]

对 RBM，

\[ \frac{\partial E}{\partial W_{ij}}=-v_ih_j. \]

代入即得

\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial W_{ij}} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[v_ih_j] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[v_ih_j]. \]

Bias Gradients and Mini-Batch Estimates

同理，对 visible bias 和 hidden bias：

\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial a_i} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[v_i] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[v_i], \]

\[ \frac{\partial \log p_\theta(\mathbf{v})}{\partial b_j} = \mathbb{E}_{p_\theta(\mathbf{h}\mid\mathbf{v})}[h_j] - \mathbb{E}_{p_\theta(\mathbf{v},\mathbf{h})}[h_j]. \]

对 mini-batch \(\{v^{(n)}\}_{n=1}^B\)，positive phase 可以精确算出 hidden probability：

\[ P_H^+ = \sigma(\mathbf{1}b^\top + V^+ W), \]

于是

\[ \widehat{\nabla_W^+} = \frac1B (V^+)^\top P_H^+. \]

negative phase 用 Gibbs chain 得到 \(V^-,H^-\) 或概率 \(P_H^-\)：

\[ \widehat{\nabla_W^-} = \frac1B (V^-)^\top P_H^-. \]

参数更新方向为

\[ \Delta W \propto \widehat{\nabla_W^+}-\widehat{\nabla_W^-}. \]

def rbm_gradients(v_pos, v_neg, a, b, w):
    ph_pos = torch.sigmoid(b + v_pos @ w)
    ph_neg = torch.sigmoid(b + v_neg @ w)
    batch = v_pos.shape[0]
    grad_w = (v_pos.T @ ph_pos - v_neg.T @ ph_neg) / batch
    grad_a = (v_pos - v_neg).mean(dim=0)
    grad_b = (ph_pos - ph_neg).mean(dim=0)
    return grad_a, grad_b, grad_w

注意这段代码是“手写学习规则”，不是标准 autograd loss。RBM 的难点正是 negative phase 需要来自模型分布的样本，而不是一个普通前向图里的监督 loss。

Partition Function and Intractability

难点在于 \(Z_\theta\)：

\[ Z_\theta = \sum_{\mathbf{v},\mathbf{h}} \exp(-E_\theta(\mathbf{v},\mathbf{h})). \]

如果 visible 和 hidden 总共有 \(D+K\) 个二值变量，状态数就是 \(2^{D+K}\)。这使得精确 maximum likelihood 很快不可行。现代生成模型里很多困难都能在这里看到雏形：模型分布能写出来，但 normalization、sampling 或 likelihood gradient 很难。

Exact Partition Function for Tiny RBM

小模型可以精确枚举，用来验证实现：

def all_binary_states(n, device):
    values = torch.arange(2 ** n, device=device)
    bits = ((values[:, None] >> torch.arange(n, device=device)) & 1).float()
    return bits


def exact_log_partition(a, b, w):
    v_all = all_binary_states(a.numel(), a.device)
    h_all = all_binary_states(b.numel(), b.device)
    energies = []
    for h in h_all:
        h_batch = h.expand(v_all.shape[0], -1)
        energies.append(rbm_energy(v_all, h_batch, a, b, w))
    e = torch.cat(energies)
    return torch.logsumexp(-e, dim=0)

这个函数只能用于很小的 \(D+K\)。它的价值是测试：free energy、Gibbs sampler、CD 更新方向是否和 exact likelihood 的小规模结果一致。

Pitfall: Tiny Exact Checks Do Not Scale

Exact enumeration is a unit test, not a training method. The state count doubles with every additional binary unit.

Contrastive Divergence

Hinton 提出的 Contrastive Divergence (CD-\(k\)) 是经典近似：

从真实数据 \(\mathbf{v}^{(0)}\) 出发。
采样 \(\mathbf{h}^{(0)}\sim p(\mathbf{h}\mid \mathbf{v}^{(0)})\)。
交替 Gibbs sampling \(k\) 步得到 \(\mathbf{v}^{(k)},\mathbf{h}^{(k)}\)。
用

\[ \Delta W \propto \mathbf{v}^{(0)}{\mathbf{h}^{(0)}}^\top - \mathbf{v}^{(k)}{\mathbf{h}^{(k)}}^\top \]

近似梯度。

Pitfall: CD Is Not Exact Maximum Likelihood

CD-\(k\) is biased because the negative sample is not drawn from the true model distribution unless the Markov chain has mixed. It works as a practical learning rule, but the objective it optimizes is only an approximation to maximum likelihood.

CD-k Training Step

一个手写 CD-\(k\) step：

def cd_k(v0, a, b, w, k):
    v = v0
    for _ in range(k):
        _, h = sample_h_given_v(v, b, w)
        _, v = sample_v_given_h(h, a, w)
    return v


@torch.no_grad()
def rbm_cd_update(v0, a, b, w, lr, k):
    vk = cd_k(v0, a, b, w, k)
    grad_a, grad_b, grad_w = rbm_gradients(v0, vk, a, b, w)
    a.add_(lr * grad_a)
    b.add_(lr * grad_b)
    w.add_(lr * grad_w)
    return vk

这里用 torch.no_grad() 是因为我们显式执行 RBM 学习规则，而不是通过 autograd 优化一个可微标量 loss。若把采样过程放进 autograd，Bernoulli sample 也不可直接反传。

Persistent Contrastive Divergence

CD-\(k\) 每次从数据开始，negative sample 容易太靠近数据。Persistent CD 维护一组 fantasy particles：

@torch.no_grad()
def pcd_update(v0, particles, a, b, w, lr, k):
    v_neg = particles
    for _ in range(k):
        _, h_neg = sample_h_given_v(v_neg, b, w)
        _, v_neg = sample_v_given_h(h_neg, a, w)
    grad_a, grad_b, grad_w = rbm_gradients(v0, v_neg, a, b, w)
    a.add_(lr * grad_a)
    b.add_(lr * grad_b)
    w.add_(lr * grad_w)
    particles.copy_(v_neg)

Persistent chain 更接近模型分布，但也更容易出现 mixing 问题：如果学习率太大，模型分布一直在移动，chains 追不上；如果能量地形有深 basin，chains 会卡住。

Gibbs Sampling as Alternating Denoising

RBM 的 block Gibbs sampling 是：

\[ \mathbf{h}^{(t)}\sim p(\mathbf{h}\mid\mathbf{v}^{(t)}), \]

\[ \mathbf{v}^{(t+1)}\sim p(\mathbf{v}\mid\mathbf{h}^{(t)}). \]

这和后来很多生成模型有相似精神：从一个 corrupted 或 model-generated state 出发，反复用条件分布修正。区别是 RBM 的条件分布来自能量函数，而 diffusion/denoising model 的条件分布通常由深网络直接预测。

Contrastive Divergence 的 \(k\) 控制 negative sample 离数据有多远：

Method	Negative sample
CD-1	one-step reconstruction
CD-k	short Markov chain from data
Persistent CD	chain persists across updates
Exact ML	true model samples after mixing

CD-1 很像 autoencoder reconstruction pressure；Persistent CD 更接近真实 negative phase，但工程上要维护 Markov chains。

Mixing Diagnostics

Gibbs chain 是否混合，是 RBM 训练最容易被忽略的问题。常见诊断：

Diagnostic	What to watch
reconstruction error	CD-1 是否只学会局部重建
fantasy samples	persistent particles 是否多样
hidden activation rate	hidden units 是否全开/全关
free-energy gap	data free energy 是否低于 negative samples
autocorrelation	chain 是否长时间卡在同一区域

free-energy gap 可以写作

\[ \Delta F = \mathbb{E}_{v\sim\text{data}}F(v) - \mathbb{E}_{v\sim\text{neg}}F(v). \]

训练好的模型通常应让 data free energy 更低，即 \(\Delta F<0\)。但如果差距越来越大而 fantasy samples 退化，可能是 negative chains 没有混合，模型在“自说自话”。

Free Energy

RBM 的 hidden units 可以被解析求和，得到 visible state 的 free energy：

\[ F(\mathbf{v}) = -\mathbf{a}^{\top}\mathbf{v} -\sum_j \log \left( 1+\exp(b_j+W_{:j}^{\top}\mathbf{v}) \right). \]

于是

\[ p(\mathbf{v}) = \frac{\exp(-F(\mathbf{v}))}{Z}. \]

free energy 是理解 RBM 的好入口：hidden units 像一组 soft feature detectors；如果某些 hidden feature 能很好解释 \(\mathbf{v}\)，对应的 log-sum-exp 项就会降低 \(F(\mathbf{v})\)，提高数据概率。

Proof

RBM 边缘概率为

\[ p(\mathbf{v}) \propto \sum_{\mathbf{h}}\exp(-E(\mathbf{v},\mathbf{h})). \]

代入能量：

\[ -E(\mathbf{v},\mathbf{h}) = \mathbf{a}^\top\mathbf{v} + \sum_j h_j(b_j+W_{:j}^\top\mathbf{v}). \]

对 hidden units 求和，因为它们条件独立：

\[ \sum_{\mathbf{h}} \exp(-E) = \exp(\mathbf{a}^\top\mathbf{v}) \prod_j \sum_{h_j\in\{0,1\}} \exp(h_j(b_j+W_{:j}^\top\mathbf{v})). \]

二值求和给出

\[ \sum_{h_j\in\{0,1\}} \exp(h_j z_j) = 1+\exp(z_j). \]

所以

\[ \sum_{\mathbf{h}}\exp(-E) = \exp\left( \mathbf{a}^\top\mathbf{v} + \sum_j\log(1+\exp(b_j+W_{:j}^\top\mathbf{v})) \right). \]

定义 \(p(\mathbf{v})\propto \exp(-F(\mathbf{v}))\)，即可得到 free energy 公式。

Stable Free-Energy Implementation

free energy 里的 \(\log(1+\exp x)\) 应使用 softplus：

\[ \operatorname{softplus}(x)=\log(1+e^x). \]

import torch.nn.functional as F


def free_energy(v, a, b, w):
    hidden_logits = b + v @ w
    return -(v @ a) - F.softplus(hidden_logits).sum(dim=-1)

可以用它做二分类式诊断：

def free_energy_gap(v_data, v_neg, a, b, w):
    return free_energy(v_data, a, b, w).mean() - free_energy(v_neg, a, b, w).mean()

Definition: Free-Energy Gap

The free-energy gap is the difference between average free energy on data samples and average free energy on negative/model samples.

Pseudo-Likelihood

精确 likelihood 需要 \(Z\)，但可以用 pseudo-likelihood 做 cheap monitoring。随机选一个 visible bit \(i\)，比较原样本和翻转该 bit 后的 free energy：

\[ \log p(v_i\mid v_{\setminus i}) = \log\sigma(F(v^{\text{flip }i})-F(v)). \]

直觉：如果翻转一个真实 bit 会升高 free energy，那么模型认为原 bit 更合理。

def pseudo_likelihood(v, a, b, w, bit_idx):
    v_flip = v.clone()
    v_flip[:, bit_idx] = 1.0 - v_flip[:, bit_idx]
    return F.logsigmoid(free_energy(v_flip, a, b, w) - free_energy(v, a, b, w))

Pseudo-likelihood 不是 exact likelihood，但它不需要 partition function，适合训练中观察趋势。

Temperature and Sampling

Boltzmann distribution 常带 temperature：

\[ p_T(\mathbf{x}) = \frac{\exp(-E(\mathbf{x})/T)}{Z_T}. \]

\(T\) 大时分布更平，采样更随机；\(T\) 小时分布集中在低能状态。模拟退火会逐步降低 \(T\)，先探索，再收敛。

Definition: Temperature in Energy Models

Temperature rescales energy differences. Lower temperature sharpens the distribution around low-energy states; higher temperature smooths the distribution and encourages exploration.

这与今天 LLM decoding 的 temperature 有形式相似：都是缩放 logits/energy，从而改变采样分布的 entropy。区别是 Boltzmann Machine 的 temperature 作用在全局 state energy 上，而 LLM temperature 作用在下一 token logits 上。

Annealed Importance Sampling Intuition

若要估计 \(Z_\theta\)，一个经典思路是从容易归一化的 base distribution \(p_0\) 平滑过渡到目标分布 \(p_1\)。定义中间分布：

\[ p_{\beta}(\mathbf{x}) \propto \exp\left( -(1-\beta)E_0(\mathbf{x})-\beta E_1(\mathbf{x}) \right), \qquad \beta\in[0,1]. \]

Annealed Importance Sampling (AIS) 用一串 \(\beta_0=0<\beta_1<\cdots<\beta_M=1\)，在每个中间分布附近做 MCMC transition，并累计 importance weight。直觉上，直接从简单分布跳到复杂 RBM 分布太难；慢慢退火可以减少权重方差。

Definition: Annealed Importance Sampling

Annealed Importance Sampling estimates a partition-function ratio by moving samples through a sequence of intermediate distributions between an easy base model and the target model.

AIS 实现细节很多，本讲义不把它作为核心训练算法，但要知道：当论文报告 RBM likelihood 时，常常不是精确算出的，而是 AIS 估计。

RBM as a Shallow Latent Variable Model

RBM 可以看作一个浅层 latent variable model：

\[ p(\mathbf{v}) = \sum_{\mathbf{h}}p(\mathbf{v},\mathbf{h}). \]

hidden units 学到的是解释 visible correlations 的 latent factors。早期 Deep Belief Network 会逐层训练 RBM，把上一层 hidden activations 当下一层 visible data。这在今天看不再是主流，但它提出了一个重要训练范式：

用无监督目标预训练 representation；
用局部 learning rule 初始化深层模型；
再用 supervised fine-tuning 调整。

这条路线后来被大规模自监督预训练继承，只是模型从 RBM 换成了 Transformer，目标从 contrastive divergence 换成了 next-token 或 denoising objective。

Why This Matters for Deep Learning

Boltzmann Machine 在今天不再是主流大模型训练路线，但它留下了几个基础思想：

Idea	Later echo
Energy function	EBMs, score matching, diffusion score networks
Hidden variables	VAE latent variables, representation learning
Positive/negative phase	Contrastive learning, noise-contrastive estimation
Gibbs sampling	MCMC, Langevin dynamics, denoising chains
Free energy	Variational bounds and normalized/unnormalized modeling

如果说 Hopfield network 告诉我们“记忆可以是能量极小点”，Boltzmann Machine 则告诉我们“概率分布可以由能量地形定义”。后来的 VAE、GAN、diffusion、flow matching 都在某种意义上继续回答同一个问题：如何让模型学到数据分布，而不被不可解的归一化常数困住。

Implementation Checklist

实现 RBM / Boltzmann Machine 时检查：

visible/hidden state 是 {0,1} 还是 {-1,+1}，能量公式是否匹配；
W shape 是否为 [D,K]，batch matrix multiply 是否对齐；
条件概率是否用稳定 sigmoid；
free energy 是否用 softplus；
positive phase 是否使用 \(p(h\mid v)\) 的概率而不是无意义地增加采样噪声；
negative phase 是否来自 Gibbs/CD/PCD，而不是直接复用数据；
CD-\(k\) 的 \(k\) 是否足够，是否观察 fantasy samples；
PCD particles 是否持久、是否被意外 reinitialized；
learning rate 是否让 persistent chains 跟得上模型变化；
hidden activation 是否塌缩到全 0 或全 1；
free-energy gap、pseudo-likelihood、reconstruction error 是否一起看；
tiny model 是否用 exact partition function 做过 sanity check；
若报告 likelihood，是否说明 exact、AIS estimate，还是 proxy metric。