0.3 Hebbian Learning

Hebbian learning 是神经网络历史里最早、最有生命力的学习原则之一。它不是一个具体 architecture，而是一类局部学习规则：连接强度的变化只依赖 pre-synaptic activity、post-synaptic activity，以及可能的局部调制信号。

The Core Rule

Definition: Hebbian Learning

Hebbian learning updates the synaptic weight from neuron \(j\) to neuron \(i\) according to correlated activity: \[ \Delta w_{ij} = \eta\, y_i x_j, \] where \(x_j\) is pre-synaptic activity, \(y_i\) is post-synaptic activity, and \(\eta\) is the learning rate.

最著名的口号是：

Neurons that fire together wire together.

把它写成矩阵形式，若 \(\mathbf{x}\) 是输入，\(\mathbf{y}\) 是输出，则

\[ \Delta W = \eta\, \mathbf{y}\mathbf{x}^{\top}. \]

这就是 outer product memory，也是 Hopfield storage rule 的基础。

Outer Product as Local Credit Assignment

Hebbian update 的特别之处是它不需要反向传播误差信号。对 synapse \(w_{ij}\)，更新只用到：

pre-synaptic activity \(x_j\)；
post-synaptic activity \(y_i\)；
learning rate 或调制信号 \(\eta\)。

这就是 local rule：

\[ \Delta w_{ij}=F(x_j,y_i,\eta). \]

对比 backprop，线性层 \(y=Wx\) 的梯度是：

\[ \Delta W \propto -\frac{\partial \mathcal{L}}{\partial W} = -\delta x^\top, \]

其中 \(\delta=\partial \mathcal{L}/\partial y\) 是 global loss 通过链式法则传回来的误差信号。Hebbian rule 用 \(y\) 替代了 \(\delta\)：它增强“已经共同活跃”的连接，而不是“降低 loss 所需”的连接。

Pitfall: Hebbian Is Local, Not Necessarily Objective-Descending

Hebbian updates are local correlation updates. They do not automatically minimize a supervised loss unless extra constraints, objectives, or modulatory signals are added.

Why Plain Hebbian Learning Explodes

如果 \(y_i\) 和 \(x_j\) 长期同号，\(w_{ij}\) 会不断增加，没有任何机制阻止权重发散。因此朴素 Hebbian rule 需要 normalization、decay 或 competition。

Pitfall: Correlation Alone Is Not Stable Learning

The update \(\Delta w_{ij}=\eta y_i x_j\) stores correlations, but without normalization it does not define a stable optimization procedure. Practical Hebbian rules require constraints such as weight decay, norm projection, or competition.

更形式化地，若 \(y=w^\top x\)，朴素 Hebbian rule 是：

\[ \Delta w=\eta yx=\eta xx^\top w. \]

取期望：

\[ \mathbb{E}[\Delta w] = \eta Cw, \qquad C=\mathbb{E}[xx^\top]. \]

如果 \(C\) 的最大特征值 \(\lambda_1>0\)，沿最大特征向量方向会指数增长。连续时间近似：

\[ \frac{dw}{dt}=Cw. \]

若 \(w(0)=\sum_i a_i v_i\)，其中 \(Cv_i=\lambda_i v_i\)，则：

\[ w(t)=\sum_i a_i e^{\lambda_i t}v_i. \]

最大 \(\lambda_i\) 的方向会主导，但 norm 也会发散。这解释了为什么 Oja rule 要加上 normalization term。

Proof

因为 \(C\) 对称半正定，它有正交特征分解：

\[ C=V\Lambda V^\top. \]

连续动力系统 \(dw/dt=Cw\) 的解是：

\[ w(t)=e^{Ct}w(0)=Ve^{\Lambda t}V^\top w(0). \]

写成特征向量展开就得到

\[ w(t)=\sum_i a_i e^{\lambda_i t}v_i. \]

只要某个 \(\lambda_i>0\) 且对应 \(a_i\neq0\)，norm 就会增长；最大 eigenvalue 方向增长最快。

Oja’s Rule

Oja’s rule 是 Hebbian learning 与 PCA 之间的桥梁。令单个线性 neuron 的输出为

\[ y = \mathbf{w}^{\top}\mathbf{x}. \]

Oja’s rule 写作

\[ \Delta \mathbf{w} = \eta\, y(\mathbf{x}-y\mathbf{w}). \]

它等价于 Hebbian 增强项 \(\eta y\mathbf{x}\) 加上一个稳定化项 \(-\eta y^2\mathbf{w}\)。

这个稳定项可以从“每次 Hebbian 更新后重新归一化”推出来。先做朴素 Hebbian：

\[ w'=w+\eta yx. \]

再把 \(w'\) 投影回单位球面：

\[ w_{\text{new}} = \frac{w'}{\|w'\|}. \]

若 \(\|w\|=1\) 且 \(\eta\) 很小，

\[ \|w+\eta yx\|^2 = 1+2\eta y w^\top x+O(\eta^2) = 1+2\eta y^2+O(\eta^2). \]

因此：

\[ \|w+\eta yx\| \approx 1+\eta y^2. \]

一阶近似：

\[ w_{\text{new}} \approx (w+\eta yx)(1-\eta y^2) = w+\eta yx-\eta y^2w+O(\eta^2). \]

所以：

\[ \Delta w \approx \eta y(x-yw), \]

这正是 Oja’s rule。

Theorem: Oja’s Rule Learns the First Principal Component

Under small learning rate and zero-mean data, Oja’s rule converges to the leading eigenvector of the input covariance matrix, up to sign.

Proof Sketch

取期望并令 \(C=\mathbb{E}[\mathbf{x}\mathbf{x}^{\top}]\)：

\[ \mathbb{E}[\Delta\mathbf{w}] = \eta \left( C\mathbf{w} - (\mathbf{w}^{\top}C\mathbf{w})\mathbf{w} \right). \]

固定点满足

\[ C\mathbf{w} = (\mathbf{w}^{\top}C\mathbf{w})\mathbf{w}, \]

也就是 \(\mathbf{w}\) 是 covariance matrix 的 eigenvector。由于 dynamics 沿 Rayleigh quotient 上升，稳定固定点对应最大 eigenvalue 的 eigenvector。

这说明早期神经计算并不是“玄学神经元模拟”，而是在非常具体地做统计学习：提取最大方差方向。

Norm Stability

Oja rule 的连续期望形式：

\[ \frac{dw}{dt} = Cw-(w^\top Cw)w. \]

看 norm 的变化：

\[ \frac{d}{dt}\|w\|^2 = 2w^\top\frac{dw}{dt} = 2w^\top Cw(1-\|w\|^2). \]

若 \(\|w\|<1\)，norm 增大；若 \(\|w\|>1\)，norm 减小；\(\|w\|=1\) 是稳定的。也就是说，Oja rule 不只是“加了一个衰减项”，它把 weight norm 自动拉向单位球。

Proof

代入 Oja dynamics：

\[ \frac{d}{dt}\|w\|^2 = 2w^\top(Cw-(w^\top Cw)w). \]

第一项是 \(2w^\top Cw\)。第二项是：

\[ 2(w^\top Cw)(w^\top w) = 2(w^\top Cw)\|w\|^2. \]

相减得到：

\[ 2w^\top Cw(1-\|w\|^2). \]

Oja’s Rule as Constrained Optimization

PCA 的第一主成分可以写成：

\[ \max_{\|\mathbf{w}\|_2=1} J(\mathbf{w}) = \frac12\mathbb{E}[(\mathbf{w}^\top\mathbf{x})^2] = \frac12\mathbf{w}^\top C\mathbf{w}. \]

约束优化的 Lagrangian:

\[ \mathcal{L}(\mathbf{w},\lambda) = \frac12\mathbf{w}^\top C\mathbf{w} - \frac{\lambda}{2}(\mathbf{w}^\top\mathbf{w}-1). \]

一阶条件：

\[ C\mathbf{w}=\lambda\mathbf{w}. \]

所以 stationary points 是 covariance eigenvectors。Oja’s rule 的期望形式

\[ \Delta\mathbf{w} \propto C\mathbf{w} - (\mathbf{w}^\top C\mathbf{w})\mathbf{w} \]

可以看作在单位球面上的 gradient ascent：第一项提高投影方差，第二项把 norm 拉回去。

Definition: Local Learning Rule

A local learning rule updates a synapse using only variables available at or near that synapse, such as pre-synaptic activity, post-synaptic activity, and local modulatory signals.

Oja’s rule 的迷人之处在于：它没有全局 loss backprop，却能实现一个清楚的统计目标。

Stochastic and Mini-Batch Oja

前面的推导使用期望 \(\mathbb{E}[xx^\top]\)。真实在线学习只能看到样本流。单样本 Oja update 是：

\[ y_t=w_t^\top x_t, \qquad w_{t+1} = w_t+\eta_t y_t(x_t-y_tw_t). \]

它是期望动力系统

\[ \frac{dw}{dt} = Cw-(w^\top Cw)w \]

的 stochastic approximation。噪声来自用 \(x_tx_t^\top\) 估计 \(C\)。mini-batch 版本把 batch covariance 作为估计：

\[ \hat C_B = \frac1B\sum_{b=1}^{B}x_bx_b^\top. \]

于是

\[ \Delta w = \eta \left( \hat C_Bw - (w^\top\hat C_Bw)w \right). \]

等价写成矩阵代码：

def oja_batch_step(w, x, lr):
    # x: [B, D], w: [D]
    y = x @ w                         # [B]
    hebb = (y[:, None] * x).mean(0)    # E[y x]
    norm = (y.square().mean()) * w     # E[y^2] w
    w = w + lr * (hebb - norm)
    return w / (w.norm() + 1e-12)

这里最后的 explicit normalization 不是 Oja rule 必须的一部分，但在有限精度和较大学习率下很有用：它把数值误差和偶发大 batch 从“norm 漂移”中拉回来。

Definition: Stochastic Approximation

Stochastic approximation replaces an expected update by noisy sample-based updates whose expectation follows the target dynamical system.

Learning-Rate Stability

Oja rule 需要小学习率。可以从离散更新看出原因。若当前 \(w\) 已接近第一主成分 \(v_1\)，输出尺度约为

\[ y^2\approx \lambda_1. \]

更新中的稳定项大小约为 \(\eta\lambda_1 w\)。如果

\[ \eta\lambda_1 \gtrsim 1, \]

一次更新就可能过度校正，导致震荡或 norm spike。实践中常用 decreasing step size：

\[ \eta_t = \frac{\eta_0}{1+t/\tau}, \]

或者先用小的常数学习率确认收敛趋势，再逐渐降低。

Pitfall: Oja Is Stable in Theory, Not for Arbitrary Step Size

The continuous Oja dynamics stabilizes the norm, but the discrete stochastic update can still diverge when the learning rate is too large relative to the top covariance eigenvalue.

Centering and Running Statistics

如果输入均值不为零，Oja 会学 second moment 的第一方向，而不是 covariance 的第一方向。在线系统里不能总是先计算全数据均值，可以维护 running mean：

\[ \mu_t = (1-\alpha)\mu_{t-1}+\alpha x_t, \qquad \tilde x_t=x_t-\mu_t. \]

然后用 \(\tilde x_t\) 做 Oja update。mini-batch 也可以使用 batch mean，但这会引入 batch 依赖：

def center_batch(x, running_mean=None, momentum=0.01):
    batch_mean = x.mean(axis=0)
    if running_mean is None:
        running_mean = batch_mean
    else:
        running_mean = (1 - momentum) * running_mean + momentum * batch_mean
    return x - running_mean, running_mean

这和 BatchNorm 的 running statistics 有同样的工程味道：统计量本身就是模型状态，必须决定它来自 train set、batch、stream，还是固定 calibration window。

Hebbian Rule as Matrix Factorization

考虑一组 patterns \(\mathbf{x}^{(n)}\)，Hebbian update 累积后得到

\[ W = \eta\sum_{n}\mathbf{y}^{(n)}{\mathbf{x}^{(n)}}^\top. \]

如果 \(\mathbf{y}^{(n)}=\mathbf{x}^{(n)}\)，那么

\[ W = \eta X^\top X, \]

它记录了输入维度之间的二阶相关。这里已经能看到很多现代方法的影子：

如果数据没有中心化，\(X^\top X\) 记录的是 second moment：

\[ \mathbb{E}[xx^\top] = \operatorname{Cov}(x)+\mu\mu^\top. \]

其中 \(\mu=\mathbb{E}[x]\)。因此 Hebbian rule 对均值很敏感：如果输入整体有强 DC component，第一主方向可能只是“平均亮度”或“常见背景”，而不是有意义的变化方向。

中心化 Hebbian rule 使用：

\[ \Delta W = \eta (y-\bar{y})(x-\bar{x})^\top. \]

这更接近 covariance learning。PCA、whitening、batch normalization 的直觉都和“先去掉均值，再学相关结构”有关。

Hebbian object	Modern view
\(\mathbf{y}\mathbf{x}^{\top}\)	Rank-1 update
\(\sum_n \mathbf{x}^{(n)}{\mathbf{x}^{(n)}}^\top\)	Empirical covariance
Competition + normalization	PCA / whitening
Local update	Biologically plausible credit assignment

Whitening and Decorrelation

如果 PCA 只保留方向，whitening 还会把每个主成分缩放到单位方差。设 centered covariance 分解为

\[ C = U\Lambda U^\top, \qquad \Lambda=\operatorname{diag}(\lambda_1,\ldots,\lambda_d). \]

PCA 坐标是

\[ z=U^\top x. \]

它的 covariance 是

\[ \mathbb{E}[zz^\top] = U^\top C U = \Lambda. \]

Whitening 再乘以 \(\Lambda^{-1/2}\)：

\[ \tilde z = \Lambda^{-1/2}U^\top x. \]

于是

\[ \mathbb{E}[\tilde z\tilde z^\top] = I. \]

Definition: Whitening

Whitening transforms centered data so that its covariance matrix is approximately the identity matrix.

这解释了 Hebbian/Sanger/Oja 与 normalization 的关系：Hebbian rule 学相关结构，竞争项去相关，归一化项控制尺度。现代网络里的 BatchNorm、LayerNorm、RMSNorm 不等同于 whitening，但都在处理“表示尺度和相关结构会影响学习动力学”这个问题。

一个最小 whitening check：

def whitening_matrix(x, eps=1e-5):
    x = x - x.mean(axis=0, keepdims=True)
    cov = x.T @ x / max(len(x) - 1, 1)
    eigvals, eigvecs = np.linalg.eigh(cov)
    inv_sqrt = np.diag(1.0 / np.sqrt(eigvals + eps))
    return eigvecs @ inv_sqrt @ eigvecs.T


wmat = whitening_matrix(x)
z = (x - x.mean(axis=0, keepdims=True)) @ wmat
cov_z = z.T @ z / (len(z) - 1)

如果 cov_z 的对角线接近 1、非对角线接近 0，说明 whitening 近似成功。若有很小 eigenvalue，eps 会决定噪声是否被过度放大。

Pitfall: Whitening Can Amplify Low-Variance Noise

Dividing by small eigenvalues can turn tiny noisy directions into large features. Practical whitening needs an epsilon, cutoff, or dimensionality reduction.

Sanger’s Rule and Multiple Components

单个 Oja neuron 学第一主成分。如果要学多个主成分，需要竞争和去相关。Sanger’s rule / Generalized Hebbian Algorithm 对第 \(i\) 个输出 neuron:

\[ y_i=\mathbf{w}_i^\top\mathbf{x}, \]

更新

\[ \Delta \mathbf{w}_i = \eta y_i \left( \mathbf{x} - \sum_{j\le i}y_j\mathbf{w}_j \right). \]

前面的 components 会从输入中被减掉，因此后面的 neuron 学剩余方差方向。这和 Gram-Schmidt 正交化有相似精神。

若把所有输出写成

\[ y=Wx, \qquad W\in\mathbb{R}^{k\times d}, \]

GHA 的矩阵形式近似为：

\[ \Delta W = \eta \left( yx^\top - \operatorname{LT}(yy^\top)W \right), \]

其中 \(\operatorname{LT}\) 表示保留下三角部分。下三角结构保证第 \(i\) 个 component 只被前面已经学到的 components 抑制，因此学到的是有序主成分。

Definition: Anti-Hebbian Competition

Anti-Hebbian competition weakens correlations between competing units, often using terms like \(-y_i y_j w_j\) to decorrelate or orthogonalize learned features.

如果去相关项用完整 \(yy^\top W\)，就更像 symmetric subspace learning；如果只用 \(j\leq i\)，就得到 ordered PCA。

Rule	Learns	Stabilizer
Hebb	correlation	none
Oja	first PC	self-normalization
Sanger/GHA	ordered PCs	subtract previous components
Competitive Hebbian	prototypes/features	winner-take-all competition

这说明 Hebbian learning 不只是“相关就增强”的一句口号；加上不同约束后，它可以实现 PCA、聚类、稀疏编码等不同学习目标。

Batch GHA Implementation

对 \(k\) 个 components，令

\[ Y=XW^\top, \qquad X\in\mathbb{R}^{B\times d}, \qquad W\in\mathbb{R}^{k\times d}. \]

batch 版 Hebbian 项是

\[ \frac1B Y^\top X \]

按行看是每个 component 的 \(E[y_i x]\)。竞争项需要下三角的 \(Y^\top Y\)：

\[ G_y = \frac1B Y^\top Y \in\mathbb{R}^{k\times k}. \]

然后只保留下三角：

\[ L=\operatorname{LT}(G_y). \]

代码里更直接：

def gha_batch_step(W, X, lr):
    # W: [K, D], X: [B, D]
    Y = X @ W.T                         # [B, K]
    hebb = Y.T @ X / X.shape[0]          # [K, D]
    yy = Y.T @ Y / X.shape[0]            # [K, K]
    lower = np.tril(yy)
    anti = lower @ W                     # [K, D]
    W = W + lr * (hebb - anti)
    W = W / (np.linalg.norm(W, axis=1, keepdims=True) + 1e-12)
    return W

这里的 np.tril 是 Sanger rule 的核心：第 \(i\) 个 component 被自己和前面 components 约束，不被后面尚未稳定的 components 影响。

Orthogonality Diagnostics

多 component 学习不能只看每一行 norm。至少检查两个矩阵：

\[ WW^\top \approx I_k \]

和 projected covariance

\[ WCW^\top \approx \operatorname{diag}(\lambda_1,\ldots,\lambda_k). \]

第一个检查 components 是否正交；第二个检查它们是否真的对齐主方向并去相关。一个简单诊断：

def gha_diagnostics(W, X):
    Xc = X - X.mean(axis=0, keepdims=True)
    cov = Xc.T @ Xc / max(len(Xc) - 1, 1)
    gram = W @ W.T
    proj_cov = W @ cov @ W.T
    offdiag_gram = gram - np.diag(np.diag(gram))
    offdiag_cov = proj_cov - np.diag(np.diag(proj_cov))
    return {
        "max_weight_offdiag": float(np.abs(offdiag_gram).max()),
        "max_projected_cov_offdiag": float(np.abs(offdiag_cov).max()),
        "component_variances": np.diag(proj_cov),
    }

如果 max_weight_offdiag 很大，说明 components 没学成正交基；如果 projected covariance 的 off-diagonal 很大，说明表示仍然相关；如果 component variances 顺序反了，可能学习率、初始化或下三角竞争实现有问题。

Pitfall: Multiple Hebbian Units Can Collapse

Without competition or orthogonality pressure, multiple Hebbian units may all learn the same top component instead of spreading across principal directions.

Spike-Timing Dependent Plasticity

在更生物真实的模型里，学习不只看相关性，还看 spike timing。若 pre-synaptic spike 发生在 post-synaptic spike 之前，连接增强；反过来则减弱：

\[ \Delta w(\Delta t) = \begin{cases} A_+\exp(-\Delta t/\tau_+), & \Delta t>0,\\ -A_-\exp(\Delta t/\tau_-), & \Delta t<0. \end{cases} \]

其中 \(\Delta t=t_{\text{post}}-t_{\text{pre}}\)。

STDP 的直觉是 causal credit assignment：如果输入神经元先放电，随后输出神经元放电，那么这个输入更像是输出的原因。

BCM Rule and Sliding Threshold

BCM theory 引入一个 activity-dependent threshold。若 postsynaptic activity 是 \(y\)，更新可写成

\[ \Delta w_i = \eta x_i y(y-\theta_M), \]

其中 \(\theta_M\) 是随长期 activity 变化的 sliding threshold，例如

\[ \theta_M=\mathbb{E}[y^2]. \]

当 \(y>\theta_M\) 时，突触增强；当 \(0<y<\theta_M\) 时，突触减弱。这比简单 Hebb 多了一个 homeostasis 机制：神经元不能无限增强所有输入，而要根据自身长期活动调整增强门槛。

更动态地，可以写成：

\[ \tau_\theta\frac{d\theta_M}{dt} = y^2-\theta_M. \]

这表示 threshold 是 postsynaptic activity squared 的慢速移动平均。若神经元长期过度活跃，\(\theta_M\) 上升，增强变得更难；若长期沉默，\(\theta_M\) 下降，更容易发生增强。

BCM 的关键不是单个公式，而是两条时间尺度：

Variable	Timescale	Role
\(w\)	fast	stores current correlations
\(\theta_M\)	slow	stabilizes long-term activity

这种 fast plasticity + slow homeostasis 的思想后来在很多稳定训练机制中反复出现：快速梯度更新需要慢变量约束，比如 optimizer moments、normalization statistics、KL penalty 或 moving baseline。

Definition: Homeostatic Plasticity

Homeostatic plasticity refers to mechanisms that stabilize neural activity by adapting thresholds, gains, or synaptic strengths to keep activity within a functional range.

Connection to Backpropagation

Hebbian learning 是 local 的，而 backpropagation 是 global credit assignment。二者差异可以用一张表概括：

Property	Hebbian learning	Backpropagation
Signal source	Local activity	Global loss gradient
Update form	Correlation-based	Chain-rule derivative
Biological plausibility	Higher	Lower
Optimization target	Often implicit	Explicit objective
Deep networks	Hard without extra mechanisms	Standard method

但这并不意味着 Hebbian learning 已经过时。现代 self-supervised learning、contrastive learning、attention memory、adapter/LoRA 的 rank update，都能看到“通过相关结构写入知识”的影子。

Hebbian View of Contrastive Learning

Contrastive learning 常把 matched pair 拉近，把 mismatched pair 推远。若 \(q\) 是 query representation，\(k^+\) 是 positive key，InfoNCE 的梯度会增加 \(q\) 与 \(k^+\) 的相似度，降低与 negatives 的相似度。抽象看：

\[ \Delta W \sim q{k^+}^\top - \sum_j p_j q{k_j^-}^\top. \]

第一项是 Hebbian-like positive correlation，第二项是 anti-Hebbian competition。现代 self-supervised learning 的复杂目标里，仍然有“相关增强 + 竞争归一化”的骨架。

Hebbian Updates and Low-Rank Adaptation

Hebbian outer product

\[ \Delta W=\eta \mathbf{y}\mathbf{x}^\top \]

是 rank-1 update。累计多个样本：

\[ \Delta W = \sum_{n=1}^{r}\eta_n \mathbf{y}^{(n)}{\mathbf{x}^{(n)}}^\top \]

rank 至多为 \(r\)。这和低秩适配的形式

\[ \Delta W=BA \]

在结构上相通：知识或任务变化可以通过低秩相关结构写入权重。当然，LoRA 是通过 backprop 学到的参数化低秩更新，不是局部 Hebbian rule；但二者都强调“不是每次都需要全秩改写矩阵”。

Minimal Experiment

import numpy as np


def oja_step(w: np.ndarray, x: np.ndarray, lr: float) -> np.ndarray:
    y = float(w @ x)
    w = w + lr * y * (x - y * w)
    return w / (np.linalg.norm(w) + 1e-12)

如果用二维高斯数据反复调用这个 step，\(\mathbf{w}\) 会逐渐对齐最大方差方向。这是“神经元学习统计结构”的最小可运行例子。

更完整的可复现实验：

import numpy as np


def make_data(n: int, seed: int = 0) -> np.ndarray:
    rng = np.random.default_rng(seed)
    cov = np.array([[4.0, 1.2], [1.2, 1.0]])
    x = rng.multivariate_normal(mean=[0.0, 0.0], cov=cov, size=n)
    return x


def train_oja(x: np.ndarray, steps: int, lr: float, seed: int = 1) -> np.ndarray:
    rng = np.random.default_rng(seed)
    w = rng.normal(size=x.shape[1])
    w = w / np.linalg.norm(w)
    for t in range(steps):
        sample = x[t % len(x)]
        w = oja_step(w, sample, lr)
    return w


x = make_data(4096)
w = train_oja(x, steps=20000, lr=1e-3)
eigvals, eigvecs = np.linalg.eigh(np.cov(x.T))
pc1 = eigvecs[:, np.argmax(eigvals)]
alignment = abs(float(w @ pc1))
print(alignment)

alignment 越接近 1，说明 Oja neuron 越接近第一主成分。这个实验也提醒我们：Hebbian learning 的很多规则不是为了替代 backprop，而是为了展示局部学习如何实现明确的统计目标。

Hebbian Smoke Tests

这些测试不需要深度学习框架，只要 NumPy 就能抓住大多数实现错误。

Test 1: Plain Hebbian Norm Grows

如果没有 normalization，正相关数据上的 Hebbian update 应该让 norm 增长：

def plain_hebb_step(w, x, lr):
    y = float(w @ x)
    return w + lr * y * x


w0 = np.array([1.0, 0.0])
x = np.array([1.0, 0.0])
w1 = plain_hebb_step(w0, x, lr=0.1)
assert np.linalg.norm(w1) > np.linalg.norm(w0)

这个测试确认你实现的是 correlation reinforcement，而不是误把符号写反。

Test 2: Oja Norm Stays Near One

rng = np.random.default_rng(0)
x = rng.normal(size=(1024, 2))
w = rng.normal(size=2)
w = w / np.linalg.norm(w)
for sample in x:
    w = oja_step(w, sample, lr=1e-3)
assert abs(np.linalg.norm(w) - 1.0) < 1e-2

如果 norm 持续变大，通常是漏掉了 \(-y^2w\)；如果 norm 快速塌缩，可能是学习率、符号或中心化出了问题。

Test 3: Oja Aligns With PCA

x = make_data(4096, seed=0)
w = train_oja(x, steps=20000, lr=1e-3, seed=1)
eigvals, eigvecs = np.linalg.eigh(np.cov(x.T))
pc1 = eigvecs[:, np.argmax(eigvals)]
alignment = abs(float(w @ pc1))
assert alignment > 0.95

这个测试把“学到第一主成分”变成可检查目标。阈值不需要神圣；它应该随数据量、学习率和 step 数调整。

Test 4: GHA Learns Orthogonal Components

x = make_data(4096, seed=2)
rng = np.random.default_rng(3)
W = rng.normal(size=(2, 2))
W = W / np.linalg.norm(W, axis=1, keepdims=True)
for _ in range(2000):
    idx = rng.choice(len(x), size=64, replace=False)
    W = gha_batch_step(W, x[idx], lr=1e-3)

diag = gha_diagnostics(W, x)
assert diag["max_weight_offdiag"] < 0.2

如果两个 components collapse 到同一方向，这个测试会失败。它检查的是几何结构，而不是某个固定随机种子的精确数值。

Test 5: Whitening Covariance

x = make_data(4096, seed=4)
wmat = whitening_matrix(x, eps=1e-4)
z = (x - x.mean(axis=0, keepdims=True)) @ wmat
cov_z = z.T @ z / (len(z) - 1)
assert np.allclose(cov_z, np.eye(cov_z.shape[0]), atol=5e-2)

这个测试把 “decorrelation” 从口号变成矩阵条件：对角线接近 1，非对角线接近 0。

Implementation Pattern: Diagnose Geometry, Not Just Loss

For Hebbian/PCA-style rules, useful diagnostics are norm, alignment, covariance off-diagonal mass, and component variance ordering.

Implementation Checklist

实现或分析 Hebbian/Oja/BCM 规则时，检查：

输入是否中心化；
learning rate 是否足够小；
是否有 norm constraint、decay 或 homeostasis；
多 neuron 时是否有 competition / decorrelation；
在线更新是否打乱了时间相关性；
是否区分 second moment 和 covariance；
是否记录 weight norm、输出方差和 principal-component alignment；
GHA/Sanger 多 component 是否检查正交性和 projected covariance；
whitening 是否处理小 eigenvalue 的数值放大；
是否用 smoke tests 验证 norm、alignment、decorrelation；
是否避免把 local correlation rule 误解释成 supervised loss gradient。

Hebbian learning 的教育价值在这里：它让我们看到“学习”不一定从 global loss 开始，也可以从局部统计结构开始；但如果没有约束和竞争，相关增强会很快失控。