0.1 Hopfield Network
Hopfield network 是早期 neural computation 中最漂亮的一类模型:它把神经元看成二值自旋,把记忆看成能量函数的局部极小点,把联想回忆看成一个离散动力系统向 attractor basin 的收敛过程。它不像今天的深度网络那样强调层级表征,而是强调一个问题:如果一组模式已经写入连接权重,那么一个带噪声的输入能否自动回到最相近的记忆?
Associative Memory
Given \(p\) stored patterns \(\{\boldsymbol{\xi}^{\mu}\}_{\mu=1}^{p}\) with \(\boldsymbol{\xi}^{\mu} \in \{-1,+1\}^{N}\), an associative memory maps a corrupted cue \(\mathbf{s}\) to the stored pattern \(\boldsymbol{\xi}^{\mu}\) that lies in the same basin of attraction under the network dynamics.
这里的关键不只是“分类”,而是 content-addressable memory:地址不是一个显式 index,而是输入内容本身。给你一张残缺的人脸、一个缺字的单词、一个噪声图案,系统应该依靠动态收敛把它补回某个已存模式。
令 \(s_i \in \{-1,+1\}\) 表示第 \(i\) 个 neuron 的状态。常见的 \(0/1\) 表示可以通过
\[ s_i = 2 n_i - 1 \]
转成 spin-like 的 \(\pm 1\) 表示。这个变换的好处是对称、均值更容易接近 \(0\),也让统计物理里的 Ising model 语言可以直接搬过来。
Model
Hopfield network 是一个全连接、无自连接、对称权重的 recurrent network:
\[ w_{ij} = w_{ji}, \qquad w_{ii}=0. \]
给定当前状态 \(\mathbf{s}\),第 \(i\) 个神经元感受到的 local field 为
\[ h_i(\mathbf{s}) = \sum_{j \neq i} w_{ij}s_j - \theta_i, \]
其中 \(\theta_i\) 是 threshold。异步更新规则为
\[ s_i \leftarrow \operatorname{sgn}(h_i(\mathbf{s})). \]
这个模型的审美在于:它没有显式写出“loss”,但它事实上在下降一个能量函数;它没有 optimizer,但异步状态更新就是一个 coordinate descent。
For a Hopfield network with symmetric weights and no self-connections, the energy is \[ E(\mathbf{s}) = -\frac{1}{2}\sum_{i \neq j} w_{ij}s_i s_j + \sum_i \theta_i s_i. \] The stable memories of the network are local minima of \(E\).
State Convention and Tie-Breaking
实现 Hopfield network 时,第一个容易被忽略的细节是 \(\operatorname{sgn}(0)\)。数学上局部场刚好为 \(0\) 时,\(s_i=+1\) 或 \(-1\) 都不增加能量;工程上必须固定约定,否则不同库或不同实验会出现不可复现差异。
常见选择:
| Tie rule | Update when \(h_i=0\) | Consequence |
|---|---|---|
| keep | keep current \(s_i\) | energy non-increasing, least arbitrary |
| positive | set \(s_i=+1\) | deterministic but can bias states |
| random | sample \(\pm1\) | stochastic dynamics, harder to reproduce |
本文后面的实现默认使用 keep rule:
\[ s_i' = \begin{cases} +1,&h_i>0,\\ -1,&h_i<0,\\ s_i,&h_i=0. \end{cases} \]
Changing the tie rule changes the discrete dynamics. If two implementations disagree on \(\operatorname{sgn}(0)\), they may retrieve different attractors from the same cue.
Why Asynchronous Updates Converge
考虑一次只更新 \(s_i\),其他 \(s_j\) 固定。能量中与 \(s_i\) 有关的部分是
\[ E_i(s_i)= -s_i \sum_{j\neq i}w_{ij}s_j+\theta_i s_i = -s_i h_i. \]
如果更新后 \(s_i'=\operatorname{sgn}(h_i)\),能量变化为
\[ \Delta E =E_i(s_i')-E_i(s_i) =-(s_i'-s_i)h_i \leq 0. \]
If \(w_{ij}=w_{ji}\) and \(w_{ii}=0\), every asynchronous Hopfield update weakly decreases the energy \(E(\mathbf{s})\). Since the state space \(\{-1,+1\}^{N}\) is finite, the asynchronous dynamics converges to a fixed point.
Proof
每一步更新只会改变一个 \(s_i\)。如果当前符号已经与 \(h_i\) 一致,则 \(s_i'=s_i\),能量不变;如果不一致,则 \((s_i'-s_i)h_i>0\),于是 \(\Delta E<0\)。由于可取状态只有 \(2^N\) 个,能量不能无限严格下降,因此过程最终停在所有 neuron 都满足 \(s_i=\operatorname{sgn}(h_i)\) 的 fixed point。
这个证明也解释了为什么 synchronous update 不那么干净:如果所有 neuron 同时更新,局部能量下降的 coordinate descent 结构会被打破,系统可能出现 2-cycle。
Fixed Point, Margin, and Bit Stability
固定点条件可以写成
\[ s_i=\operatorname{sgn}(h_i), \qquad \text{equivalently} \qquad s_i h_i\ge 0. \]
更有用的是 margin:
\[ m_i=s_i h_i. \]
如果 \(m_i\) 很大,说明第 \(i\) 位很稳定;如果 \(m_i\) 接近 \(0\),一点 crosstalk noise 就可能翻转它。对整个状态可以看最小 margin:
\[ m_{\min}(\mathbf{s}) = \min_i s_i h_i. \]
The stability margin of bit \(i\) under state \(\mathbf{s}\) is \(m_i=s_i h_i(\mathbf{s})\). A state is a fixed point under keep-on-zero tie-breaking if \(m_i\ge0\) for all \(i\).
这比只问“是否 fixed point”更细。两个 pattern 都是 fixed point,但一个最小 margin 很大、另一个最小 margin 接近 \(0\),后者的 basin 往往更脆弱。
Hebbian Storage Rule
最经典的写入规则是 Hebbian outer product:
\[ w_{ij} = \frac{1}{N}\sum_{\mu=1}^{p}\xi_i^{\mu}\xi_j^{\mu}, \qquad w_{ii}=0. \]
它的直觉非常朴素:如果两个神经元在很多 pattern 中同号,就增加正连接;如果经常异号,就增加负连接。这正是 “neurons that fire together wire together” 的数学化版本。
为了看出它为什么可行,假设我们希望回忆第 \(\nu\) 个 pattern,并把当前状态设成 \(\mathbf{s}=\boldsymbol{\xi}^{\nu}\)。local field 为
\[ \begin{aligned} h_i &= \sum_{j\neq i} w_{ij}\xi_j^\nu \\ &= \frac{1}{N}\sum_{j\neq i} \sum_{\mu=1}^{p}\xi_i^\mu \xi_j^\mu \xi_j^\nu \\ &= \xi_i^\nu + \frac{1}{N}\sum_{\mu\neq \nu}\xi_i^\mu \sum_{j\neq i}\xi_j^\mu \xi_j^\nu. \end{aligned} \]
第一项是 signal,方向正好指向目标 pattern;第二项是其他 pattern 造成的 crosstalk noise。
In Hopfield storage, crosstalk noise is the interference term produced by all non-target patterns: \[ \eta_i^\nu = \frac{1}{N}\sum_{\mu\neq \nu}\xi_i^\mu \sum_{j\neq i}\xi_j^\mu \xi_j^\nu. \] Successful recall requires \(|\eta_i^\nu|\) to be small enough that \(\operatorname{sgn}(h_i)=\xi_i^\nu\).
这就是 Hopfield network 和现代 representation learning 的一个共同主题:capacity 不是免费来的。你把更多模式塞进同一组权重,模型就会开始把不同模式混在一起。
Vectorized Storage and Diagonal Removal
把所有 pattern 堆成矩阵
\[ X = \begin{bmatrix} (\boldsymbol{\xi}^1)^\top\\ \cdots\\ (\boldsymbol{\xi}^p)^\top \end{bmatrix} \in\{-1,+1\}^{p\times N}, \]
Hebbian 权重就是
\[ W=\frac1N X^\top X, \qquad \operatorname{diag}(W)=0. \]
为什么要去掉 diagonal?若保留 \(w_{ii}\),则 local field 包含自反馈 \(w_{ii}s_i\)。在 Hebbian rule 下 \(w_{ii}=p/N\),它会让当前位倾向保持原样,从而把“纠错”变成“自我确认”。去掉 diagonal 后,每一位必须由其他位来支持。
import numpy as np
def hebbian_weights(patterns):
# patterns: [p, n], entries {-1, +1}
p, n = patterns.shape
w = patterns.T @ patterns / n
np.fill_diagonal(w, 0.0)
return wCentered and Sparse Patterns
经典推导常假设 pattern 是独立均匀的 \(\pm1\),因此每一位均值为 \(0\)。如果 pattern 很 sparse,例如大多数位都是 \(-1\),直接 Hebbian outer product 会学到强全局偏置。更稳的写法是 centered Hebbian rule:
\[ w_{ij} = \frac1N \sum_{\mu=1}^p (\xi_i^\mu-a_i)(\xi_j^\mu-a_j), \]
其中 \(a_i\) 是第 \(i\) 位在训练 pattern 上的平均活动:
\[ a_i=\frac1p\sum_{\mu=1}^p \xi_i^\mu. \]
这和现代深度学习里的 centering/normalization 是同一个直觉:如果输入本身带强均值偏置,correlation matrix 会先记住均值,而不是记住真正区分 pattern 的结构。
Capacity
如果 stored patterns 是独立随机的 \(\pm 1\) pattern,那么经典 Hopfield network 的可用容量大约是
\[ p_{\max} \approx 0.138N. \]
这不是说超过 \(0.138N\) 就完全不能用,而是当 \(p/N\) 接近这个常数时,spurious minima、recall error 和 basin shrinkage 会迅速变得严重。
Hopfield capacity is not simply the number of pairwise weights \(O(N^2)\). The actual reliable pattern capacity is controlled by interference under the chosen storage rule and update dynamics. For the classical Hebbian rule with random dense patterns, the practical capacity scales as \(O(N)\).
Signal-to-Noise View of Recall
前面的 local field 分解
\[ h_i=\xi_i^\nu+\eta_i^\nu \]
给出了最重要的分析入口。若其他 patterns 独立随机,则对 \(\mu\ne\nu\),
\[ \sum_{j\neq i}\xi_j^\mu\xi_j^\nu \]
是 \(N-1\) 个独立 \(\pm1\) 随机变量的和,均值约为 \(0\),方差约为 \(N\)。外面再乘 \(\xi_i^\mu/N\),每个非目标 pattern 对 noise 的方差约为
\[ \frac{1}{N^2}N=\frac1N. \]
共有 \(p-1\) 个干扰 pattern,因此
\[ \operatorname{Var}(\eta_i^\nu)\approx\frac{p}{N}. \]
令 load factor
\[ \alpha=\frac{p}{N}. \]
则 noise scale 约为 \(\sqrt{\alpha}\)。成功回忆要求 signal \(1\) 大于 noise fluctuation。更严格的统计物理分析会给出 classical capacity 常数约 \(0.138\),但这个 signal-to-noise 计算已经解释了为什么容量是 \(O(N)\),不是 \(O(N^2)\)。
The load factor of a Hopfield network is \[ \alpha=\frac{p}{N}, \] where \(p\) is the number of stored patterns and \(N\) is the number of neurons. Increasing \(\alpha\) increases crosstalk noise and shrinks attraction basins.
Overlap Order Parameter
为了量化当前状态 \(\mathbf{s}\) 和某个 stored pattern \(\boldsymbol{\xi}^{\nu}\) 的相似度,定义 overlap:
\[ m^\nu(\mathbf{s}) = \frac1N \sum_{i=1}^N \xi_i^\nu s_i. \]
如果 \(\mathbf{s}=\boldsymbol{\xi}^\nu\),则 \(m^\nu=1\);如果完全相反,则 \(m^\nu=-1\);如果近似随机无关,则 \(m^\nu\approx0\)。Hamming distance 和 overlap 的关系是
\[ d_H(\mathbf{s},\boldsymbol{\xi}^{\nu}) = \frac{N}{2}(1-m^\nu). \]
The overlap between a state \(\mathbf{s}\) and a stored pattern \(\boldsymbol{\xi}^{\nu}\) is \(m^\nu(\mathbf{s})=\frac1N\sum_i \xi_i^\nu s_i\).
用 overlap 看 retrieval,比只看最终是否完全相等更平滑。比如图像 pattern 有少量 bit 错误时,exact match 是 \(0\),但 overlap 仍能显示系统已经靠近正确 attractor。
Mean-Field Recall Intuition
若当前状态和目标 pattern 的 overlap 是 \(m^\nu=m\),可以粗略写
\[ s_j\approx m\xi_j^\nu+\text{noise}. \]
代入 local field:
\[ h_i = \frac1N\sum_{\mu,j} \xi_i^\mu\xi_j^\mu s_j \approx \xi_i^\nu m +\eta_i. \]
于是 signal 从干净 pattern 的 \(1\) 变成了 \(m\)。带噪 cue 的 retrieval 能否继续靠近目标,取决于 signal \(m\) 是否压过 crosstalk noise \(\eta_i\)。这解释了 basin:初始 overlap 太小,系统就可能被其他 attractor 或 spurious mixture 捕获。
Basin of Attraction
一个 stored pattern 是否“可用”,不只看它是不是 fixed point,还要看它的 basin 多大。给定目标 pattern \(\boldsymbol{\xi}^\nu\),Hamming distance 为
\[ d_H(\mathbf{s},\boldsymbol{\xi}^{\nu}) = \frac12 \sum_i (1-s_i\xi_i^\nu). \]
如果从所有满足
\[ d_H(\mathbf{s},\boldsymbol{\xi}^{\nu})\le r \]
的 cue 出发都能回到 \(\boldsymbol{\xi}^{\nu}\),那么半径 \(r\) 可以看作一个经验 basin size。pattern 装得越多,crosstalk noise 越大,basin 越小。
实验上可以画 retrieval curve:
| Corruption ratio | Recall accuracy |
|---|---|
| 0% | fixed point test |
| 5% | small-noise correction |
| 20% | basin robustness |
| 40% | near-random cue |
这比只报告“存了多少 pattern”更真实。联想记忆的价值在于从残缺 cue 回忆,而不是只在干净 pattern 上不动。
Empirical Basin Protocol
一个可复现实验至少应固定:
- random seed;
- neuron 数 \(N\);
- pattern 数 \(p\) 或 load factor \(\alpha=p/N\);
- corruption ratio;
- 异步更新顺序;
- convergence criterion;
- 成功标准,是 exact match、最大 overlap,还是 Hamming distance threshold。
def corrupt(pattern, ratio, rng):
s = pattern.copy()
n_flip = int(round(ratio * len(s)))
idx = rng.choice(len(s), size=n_flip, replace=False)
s[idx] *= -1
return s
def overlap(state, pattern):
return float(np.mean(state * pattern))对每个 corruption ratio,重复多次 cue:
\[ \operatorname{success} = \mathbf{1} \left[ \arg\max_\mu m^\mu(\mathbf{s}_{\text{final}}) = \nu \right]. \]
如果要求 exact match,可以再加条件 \(m^\nu=1\)。对高噪声 cue,最大-overlap 成功率和 exact-match 成功率可能差很多;前者衡量“回到正确 basin”,后者衡量“完全清除所有 bit error”。
Spurious Attractors
即使每个 stored pattern 都是局部极小点,能量地形里也会出现额外的 attractors。典型 spurious state 包括若干 pattern 的 mixture:
\[ \mathbf{s} = \operatorname{sgn} \left( \boldsymbol{\xi}^{\mu_1} +\boldsymbol{\xi}^{\mu_2} +\boldsymbol{\xi}^{\mu_3} \right). \]
这些状态不是我们主动写入的记忆,却可能成为稳定点。它们在认知类比上像“混合记忆”,在优化类比上像 non-convex landscape 中的 bad local minima。
Detecting Spurious Retrieval
给定最终状态 \(\mathbf{s}\),可以计算所有 overlaps:
\[ \mathbf{m} = \frac1N X\mathbf{s}. \]
若最大 overlap 不接近 \(1\),但状态已经是 fixed point,则它很可能是 spurious attractor:
def classify_attractor(patterns, state, threshold=0.95):
overlaps = patterns @ state / patterns.shape[1]
best = int(np.argmax(overlaps))
if overlaps[best] >= threshold:
return "stored", best, float(overlaps[best])
return "spurious", best, float(overlaps[best])spurious attractor 的存在提醒我们:能量低不等于语义正确。这个主题在现代模型里也反复出现,比如 likelihood 高但生成质量差、reward 高但行为不对。
Synchronous Updates and 2-Cycles
异步更新像 coordinate descent,所以能量单调下降。同步更新则同时改变许多坐标,能量下降证明不再成立。最简单的问题是 2-cycle:
s(t) -> s(t+1) -> s(t)
这时没有 fixed point,但系统周期震荡。Hopfield 原始模型强调异步更新,正是因为它让离散动力系统和能量函数严格对齐。
The classical Hopfield convergence proof assumes symmetric weights, no self-connections, and asynchronous updates. If these assumptions are changed, energy may no longer decrease monotonically.
Energy Trace as a Debugging Tool
实现 Hopfield recall 时,最直接的 debug 不是看最后结果,而是记录每次异步更新后的 energy:
\[ E(\mathbf{s}) = -\frac12\mathbf{s}^\top W\mathbf{s} +\boldsymbol{\theta}^\top\mathbf{s}. \]
如果权重对称、对角为零、异步更新且 tie rule 正确,energy trace 应该单调不增。
import numpy as np
def energy(w, state, theta=None):
if theta is None:
theta = np.zeros_like(state, dtype=float)
return float(-0.5 * state @ w @ state + theta @ state)
def is_fixed_point(w, state, theta=None):
if theta is None:
theta = np.zeros_like(state, dtype=float)
h = w @ state - theta
return bool(np.all(state * h >= 0))If energy increases during asynchronous recall, first check symmetry, diagonal entries, threshold sign, tie-breaking, and whether the implementation accidentally updated multiple bits at once.
Modern Hopfield and Attention
可以把 retrieval 写成一个连续版本。给定 stored patterns \(x_i\),query \(\xi\),现代 Hopfield update 可以像 softmax attention:
\[ \operatorname{retrieve}(\xi) = \sum_i \operatorname{softmax}_i(\beta x_i^\top \xi)x_i. \]
\(\beta\) 类似 inverse temperature。\(\beta\) 小时,多个 memory 平滑平均;\(\beta\) 大时,接近最近邻检索。这和 Transformer attention
\[ \operatorname{Attention}(q,K,V) = \operatorname{softmax}\left(\frac{qK^\top}{\sqrt{d}}\right)V \]
在形式上非常接近。
差别在于:
| Classical Hopfield | Modern attention-like retrieval |
|---|---|
| binary states | continuous vectors |
| explicit energy descent | differentiable readout |
| symmetric recurrent weights | learned Q/K/V projections |
| attractor convergence | one or few soft retrieval steps |
这个连接解释了为什么 ch0 的 classical neural computation 不是历史摆设:今天的 attention 可以看成 content-addressable memory 的现代可微版本。
Relation to Modern Deep Learning
Hopfield network 对今天仍然有三个启发:
| Classical Hopfield concept | Modern analogy | Technical lesson |
|---|---|---|
| Energy minimum | Learned representation attractor | Good models shape the geometry of state space |
| Hebbian outer product | Low-rank memory / key-value association | Memory can be stored as correlations |
| Crosstalk noise | Interference between examples or tasks | Capacity depends on geometry, not just parameter count |
| Asynchronous update | Coordinate descent / iterative refinement | Inference can itself be an optimization process |
现代 Hopfield network 与 attention 之间也有深层联系:如果把 stored patterns 看成 keys/values,把 current state 看成 query,那么 retrieval 可以写成 softmax attention 的形式。差别在于 classical Hopfield 使用离散 dynamics 和显式能量,而 Transformer attention 使用连续向量空间和可微参数化。
Minimal Implementation Sketch
import numpy as np
def train_hopfield(patterns: np.ndarray) -> np.ndarray:
"""patterns: shape [p, n], values in {-1, +1}."""
p, n = patterns.shape
w = patterns.T @ patterns / n
np.fill_diagonal(w, 0.0)
return w
def recall(w: np.ndarray, state: np.ndarray, steps: int) -> np.ndarray:
s = state.copy()
n = len(s)
for t in range(steps):
i = t % n
h = w[i] @ s
if h > 0:
s[i] = 1
elif h < 0:
s[i] = -1
return s这段代码刻意保持最小形式。真正做实验时,通常还需要随机更新顺序、多次 cue 测试、Hamming distance 曲线和 basin size 评估。
Diagnostic Recall Implementation
更适合实验的版本应该:
- 使用随机异步顺序;
- 在每个 sweep 后检查 fixed point;
- 记录 energy;
- 固定 random seed;
- 显式处理 \(h_i=0\)。
def sign_keep(h, old):
if h > 0:
return 1
if h < 0:
return -1
return old
def recall_async(w, state, *, max_sweeps, rng, theta=None):
s = state.copy()
n = len(s)
if theta is None:
theta = np.zeros(n, dtype=float)
trace = [energy(w, s, theta)]
for _ in range(max_sweeps):
changed = False
for i in rng.permutation(n):
h = float(w[i] @ s - theta[i])
new_value = sign_keep(h, s[i])
changed = changed or (new_value != s[i])
s[i] = new_value
trace.append(energy(w, s, theta))
if not changed or is_fixed_point(w, s, theta):
break
return s, trace调试时可以断言 energy 单调:
def assert_nonincreasing(xs, tol=1e-9):
for a, b in zip(xs, xs[1:]):
assert b <= a + tolCapacity Sweep Sketch
容量不是单个数字,而是一条随 \(\alpha=p/N\) 和 corruption ratio 变化的曲线。实验伪代码:
def random_patterns(p, n, rng):
return rng.choice(np.array([-1, 1]), size=(p, n))
def retrieval_trial(n, p, corruption, rng):
patterns = random_patterns(p, n, rng)
w = hebbian_weights(patterns)
target = int(rng.integers(p))
cue = corrupt(patterns[target], corruption, rng)
final, trace = recall_async(w, cue, max_sweeps=20, rng=rng)
overlaps = patterns @ final / n
predicted = int(np.argmax(overlaps))
return {
"success": predicted == target,
"target_overlap": float(overlaps[target]),
"max_overlap": float(overlaps[predicted]),
"energy_drop": trace[0] - trace[-1],
"sweeps": len(trace) - 1,
}结果表格可以按 load factor 聚合:
| \(\alpha\) | corruption | success | mean target overlap | mean sweeps |
|---|---|---|---|---|
| 0.02 | 0.10 | high | near 1 | few |
| 0.10 | 0.10 | medium-high | lower | more |
| 0.20 | 0.10 | unstable | much lower | variable |
这张表比一句 “capacity is \(0.138N\)” 更像实验科学:它同时展示 pattern 数、噪声强度、吸引域大小和收敛速度。
Implementation Checklist
实现和实验 Hopfield network 时检查:
- pattern 是否是 \(\{-1,+1\}\),不是
{0,1}; - \(W\) 是否对称;
- diagonal 是否清零;
- threshold 符号是否和 energy 定义一致;
- \(\operatorname{sgn}(0)\) tie rule 是否固定;
- 更新是否真正异步;
- energy trace 是否单调不增;
- fixed point 是否用 \(s_i h_i\ge0\) 检查;
- recall success 是 exact match、最大 overlap 还是 thresholded overlap;
- basin curve 是否跨 corruption ratios 和 random seeds;
- 是否统计 spurious attractor,而不只统计失败;
- capacity sweep 是否报告 load factor \(\alpha=p/N\)。