0.4 Perceptron

Perceptron 是 neural network 史上的第一个真正意义上的可训练判别模型。它简单到几行代码就能实现，却已经包含了今天深度学习仍在使用的核心元素：参数、输入特征、线性打分、非线性决策、误差驱动更新。

Model

Definition: Perceptron

Given \(\mathbf{x}\in\mathbb{R}^{d}\) and label \(y\in\{-1,+1\}\), a perceptron predicts \[ \hat{y} = \operatorname{sgn}(\mathbf{w}^{\top}\mathbf{x}+b). \] The decision boundary is the hyperplane \(\mathbf{w}^{\top}\mathbf{x}+b=0\).

如果预测错误，即

\[ y(\mathbf{w}^{\top}\mathbf{x}+b)\leq 0, \]

则更新

\[ \mathbf{w}\leftarrow \mathbf{w}+\eta y\mathbf{x}, \qquad b\leftarrow b+\eta y. \]

这个 rule 可以理解为：如果正样本被分到负侧，就把 \(\mathbf{w}\) 朝 \(\mathbf{x}\) 推；如果负样本被分到正侧，就把 \(\mathbf{w}\) 远离 \(\mathbf{x}\) 推。

Bias as an Augmented Feature

为了推导简洁，常把 bias 合并进 feature：

\[ \tilde{\mathbf{x}} = \begin{bmatrix} \mathbf{x}\\ 1 \end{bmatrix}, \qquad \tilde{\mathbf{w}} = \begin{bmatrix} \mathbf{w}\\ b \end{bmatrix}. \]

则

\[ \mathbf{w}^\top \mathbf{x}+b = \tilde{\mathbf{w}}^\top\tilde{\mathbf{x}}. \]

Perceptron update 变成一行：

\[ \tilde{\mathbf{w}} \leftarrow \tilde{\mathbf{w}}+\eta y\tilde{\mathbf{x}}. \]

这也提醒我们：bias 的正则化和尺度并不总是和普通 feature 一样。如果把常数 1 当作一个 feature，margin bound 中的 \(R\) 也会变成 \(\|\tilde{x}\|\)。

Geometry

令 signed margin 为

\[ \gamma(\mathbf{x},y) = y(\mathbf{w}^{\top}\mathbf{x}+b). \]

分类正确等价于 \(\gamma>0\)。perceptron 更新只在 \(\gamma\leq 0\) 时发生，因此它是 mistake-driven 的。

这和现代 optimizer 的气质不同：SGD 每个 batch 都按 loss gradient 更新；perceptron 只在犯错时更新。它更像一种在线纠错机制。

Functional and Geometric Margin

对一个样本，functional margin 是：

\[ \hat{\gamma}_n = y_n(\mathbf{w}^\top\mathbf{x}_n+b). \]

它会随参数缩放改变。如果把 \((w,b)\) 乘以 \(c>0\)，functional margin 也乘以 \(c\)，但分类边界不变。

geometric margin 消除了这个缩放：

\[ \gamma_n = \frac{y_n(\mathbf{w}^\top\mathbf{x}_n+b)}{\|\mathbf{w}\|}. \]

它等于样本到超平面的有符号距离。Perceptron convergence theorem 中的 margin 更接近对单位向量 \(u\) 的 functional margin；SVM 最大化的是 normalized geometric margin。

Definition: Geometric Margin

The geometric margin of a labeled example \((x,y)\) with respect to hyperplane \((w,b)\) is \[ \frac{y(w^\top x+b)}{\|w\|}. \] It is invariant to positive rescaling of \((w,b)\).

Convergence Theorem

Theorem: Perceptron Convergence

Assume the data are linearly separable: there exists a unit vector \(\mathbf{u}\) and margin \(\gamma>0\) such that \[ y_n\mathbf{u}^{\top}\mathbf{x}_n \geq \gamma \] for all \(n\), and \(\|\mathbf{x}_n\|\leq R\). Then the perceptron algorithm makes at most \((R/\gamma)^2\) mistakes.

Proof Sketch

忽略 bias，可把 bias 合并进 augmented feature。每次犯错时

\[ \mathbf{w}_{t+1} = \mathbf{w}_{t}+y_t\mathbf{x}_t. \]

一方面，和最优方向 \(\mathbf{u}\) 的投影至少增加 \(\gamma\)：

\[ \mathbf{u}^{\top}\mathbf{w}_{t+1} = \mathbf{u}^{\top}\mathbf{w}_{t} +y_t\mathbf{u}^{\top}\mathbf{x}_t \geq \mathbf{u}^{\top}\mathbf{w}_{t}+\gamma. \]

若发生 \(M\) 次错误，则 \(\mathbf{u}^{\top}\mathbf{w}_{M}\geq M\gamma\)。

另一方面，由于更新只在 \(y_t\mathbf{w}_t^\top\mathbf{x}_t\leq0\) 时发生，

\[ \|\mathbf{w}_{t+1}\|^2 = \|\mathbf{w}_{t}\|^2 +2y_t\mathbf{w}_t^\top\mathbf{x}_t +\|\mathbf{x}_t\|^2 \leq \|\mathbf{w}_{t}\|^2+R^2. \]

所以 \(\|\mathbf{w}_{M}\|\leq R\sqrt{M}\)。结合

\[ M\gamma \leq \mathbf{u}^{\top}\mathbf{w}_M \leq \|\mathbf{w}_M\| \leq R\sqrt{M}, \]

得到 \(M\leq (R/\gamma)^2\)。

这个定理非常重要：它第一次明确告诉我们，只要数据线性可分，简单的局部更新也能在有限步内找到分类器。

Normalized Data and Learning Rate

如果使用学习率 \(\eta\)，更新为：

\[ w_{t+1}=w_t+\eta y_t x_t. \]

mistake bound 不依赖 \(\eta\)，因为 \(\eta\) 同时缩放 \(w\) 的投影和 norm。推导中：

\[ u^\top w_M\geq M\eta\gamma, \qquad \|w_M\|\leq \eta R\sqrt{M}, \]

消去 \(\eta\) 后仍得到：

\[ M\leq(R/\gamma)^2. \]

这说明 perceptron 的学习率主要改变参数尺度和中间路径，不改变可分情形下的 mistake bound 数量级。

Pitfall: Separability Is the Key Assumption

The perceptron convergence theorem does not apply to non-separable data. On noisy or overlapping classes, the algorithm may keep updating forever.

Perceptron as Subgradient Descent

Perceptron update 也可以从 loss 看。定义 perceptron loss：

\[ L(\mathbf{w};\mathbf{x},y) = \max(0,-y\mathbf{w}^\top\mathbf{x}). \]

如果样本分类正确且 margin 为正，loss 为 0，不更新。若

\[ y\mathbf{w}^\top\mathbf{x}\le0, \]

则 loss 的一个 subgradient 是

\[ \partial_{\mathbf{w}}L=-y\mathbf{x}. \]

Subgradient descent:

\[ \mathbf{w}\leftarrow \mathbf{w}-\eta(-y\mathbf{x}) = \mathbf{w}+\eta y\mathbf{x}. \]

这正是 perceptron update。

Definition: Perceptron Loss

The perceptron loss is \[ L(\mathbf{w};\mathbf{x},y)=\max(0,-y\mathbf{w}^\top\mathbf{x}), \] which penalizes misclassified examples but gives zero loss to any correctly classified example, regardless of margin size.

与 hinge loss

\[ L_{\text{hinge}} = \max(0,1-y\mathbf{w}^\top\mathbf{x}) \]

相比，perceptron loss 只要求分对；hinge loss 还要求 margin 至少为 1。这就是 SVM 和 perceptron 的关键差别：SVM 不只想分类正确，还想边界留出安全距离。

Multiclass Perceptron

二分类 perceptron 输出一个 sign。多类分类更接近现代 neural classifier：每个类别有一个 weight vector，模型先算 logits，再取最大值。

设类别集合为 \(\mathcal{Y}=\{1,\ldots,K\}\)，参数矩阵为

\[ W= \begin{bmatrix} w_1^\top\\ \vdots\\ w_K^\top \end{bmatrix} \in\mathbb{R}^{K\times d}. \]

对输入 \(x\)，类别分数为

\[ s_k=w_k^\top x, \qquad \hat y=\arg\max_{k}s_k. \]

若 \(\hat y\ne y\)，multiclass perceptron update 是：

\[ w_y\leftarrow w_y+\eta x, \qquad w_{\hat y}\leftarrow w_{\hat y}-\eta x, \]

其他类别不变。它的含义很直接：提高正确类别 logit，降低错误预测类别 logit。

Definition: Multiclass Perceptron

A multiclass perceptron scores each class by \(w_k^\top x\), predicts the highest-scoring class, and updates only the true class and the predicted wrong class when a mistake occurs.

Multiclass Loss View

Multiclass perceptron loss 可以写成

\[ L(W;x,y) = \max_{k\ne y} \left( s_k-s_y \right)_+ = \max \left( 0, \max_{k\ne y}w_k^\top x-w_y^\top x \right). \]

若错误类别

\[ \hat y=\arg\max_{k\ne y} w_k^\top x \]

超过或追平正确类别，则一个 subgradient 为：

\[ \frac{\partial L}{\partial w_{\hat y}}=x, \qquad \frac{\partial L}{\partial w_y}=-x. \]

Subgradient descent 刚好给出：

\[ w_y\leftarrow w_y+\eta x, \qquad w_{\hat y}\leftarrow w_{\hat y}-\eta x. \]

若加入 margin，得到 multiclass hinge loss：

\[ L_{\text{hinge}}(W;x,y) = \max_{k\ne y} \left( 1+w_k^\top x-w_y^\top x \right)_+. \]

这和 softmax cross entropy 的差异是：perceptron/hinge 只关注最危险的竞争类别；cross entropy 通过 softmax 对所有类别分配概率质量。

Objective	Update signal	Smooth?	Uses all classes?
multiclass perceptron	true class vs argmax wrong class	no	no
multiclass hinge	true class vs margin violator	no	usually max violator
softmax CE	probability gap \(p-e_y\)	yes	yes

Pitfall: Argmax Ties Need a Policy

Multiclass perceptron behavior depends on how ties are broken. Fix the tie-breaking rule for reproducible online training.

Margin Perceptron

可以把 perceptron 改成 margin perceptron：不仅错分时更新，margin 太小时也更新：

\[ \text{update if } y w^\top x \leq \rho. \]

其中 \(\rho>0\) 是希望保留的安全间隔。更新仍然是：

\[ w\leftarrow w+\eta yx. \]

这更接近 hinge loss 的思想。普通 perceptron 只关心符号；margin perceptron 还关心分类是否足够自信。

Non-Separable Data

若数据不可分，perceptron loss 不是强制收敛的优化目标。算法可能在几个冲突样本之间来回震荡。常见补救：

Variant	Idea
averaged perceptron	average historical weights to reduce variance
pocket algorithm	keep the best-performing weight seen so far
margin perceptron	require positive margin, not merely correct sign
logistic regression	optimize smooth probabilistic loss
SVM	optimize regularized hinge loss

因此 perceptron 是理解线性分类的入口，但不是处理 noisy data 的最终工具。

Perceptron as Online Mirror of Separability

Perceptron convergence theorem 是 separable case 的强结论；在 non-separable case 里，更有用的视角是 online learning：算法不断收到样本，先预测，再看真实标签，再更新。它不假设能一次性访问完整数据分布。

设第 \(t\) 步收到 \((x_t,y_t)\)，预测

\[ \hat y_t=\operatorname{sgn}(w_t^\top x_t), \]

然后观察 \(y_t\)。mistake indicator 为

\[ M_t=\mathbf{1}[\hat y_t\ne y_t]. \]

在线算法关心的不是固定训练集 loss，而是累计错误

\[ M_{1:T}=\sum_{t=1}^{T}M_t. \]

在可分情形，perceptron 的 mistake bound 告诉我们 \(M_{1:T}\) 有一个和 \(T\) 无关的上界；在不可分情形，通常只能和某个 comparator \(u\) 比较：

\[ \sum_{t=1}^{T}M_t \quad \text{versus} \quad \sum_{t=1}^{T}\ell(u;x_t,y_t). \]

这就是 regret view 的入口。Perceptron 本身没有像 online gradient descent 那样给出一般 convex regret bound，但它让我们看到现代在线优化的三个组成部分已经出现：

Ingredient	Perceptron form	Modern online optimization form
state	weight vector \(w_t\)	parameters \(\theta_t\)
feedback	mistake signal	loss/subgradient
update	\(w_{t+1}=w_t+y_tx_t\)	\(\theta_{t+1}=\Pi(\theta_t-\eta g_t)\)
comparator	separating vector \(u\)	best fixed parameter in hindsight

Definition: Regret

Regret compares an online learner’s cumulative loss to the cumulative loss of a fixed comparator chosen in hindsight.

这个视角解释了 averaged perceptron 为什么重要：在线更新会受样本顺序影响，最后一个 \(w_T\) 不一定代表整个训练过程的稳定判断；平均权重相当于把历史上的多个 online classifiers ensemble 起来。

XOR and the First Winter

Perceptron 的局限同样清楚：它只能学习线性可分模式。XOR 数据：

\(x_1\)	\(x_2\)	label
0	0	0
0	1	1
1	0	1
1	1	0

没有任何一条直线能把正负样本分开。这一问题在 Minsky 和 Papert 对 perceptron 的批评中成为神经网络早期受挫的重要象征。

Pitfall: XOR Does Not Kill Neural Networks

XOR only shows that a single linear threshold unit is insufficient. A multilayer network with nonlinear hidden units can represent XOR easily. The real historical bottleneck was not representation alone, but how to train multilayer networks effectively before backpropagation became practical.

From Perceptron to Logistic Regression

perceptron 使用 sign 作为 hard decision。若把它替换成 sigmoid：

\[ p(y=1\mid \mathbf{x}) = \sigma(\mathbf{w}^{\top}\mathbf{x}+b), \]

再用 negative log-likelihood，就得到 logistic regression。两者区别是：

Model	Output	Objective	Update signal
Perceptron	hard sign	implicit mistake correction	only wrong examples
Logistic regression	probability	cross entropy	all examples, scaled by confidence

这也是现代深度学习偏好 soft differentiable objectives 的原因：它不仅知道错没错，还知道错得有多自信。

Averaged Perceptron

在线学习中，最后一次迭代的 \(\mathbf{w}\) 可能受样本顺序影响很大。Averaged perceptron 使用所有历史权重的平均：

\[ \bar{\mathbf{w}} = \frac1T\sum_{t=1}^{T}\mathbf{w}_t. \]

直觉是减少 online update 的方差。很多 NLP 早期结构化预测系统使用 averaged perceptron，因为它简单、快，并且对稀疏高维特征很有效。

Definition: Online Learning

Online learning updates the model sequentially as examples arrive, often making one pass or a small number of passes over the data rather than optimizing a full-batch objective.

Perceptron 是 online learning 的经典原型：每看到一个错误就立刻修正。

Kernel Perceptron

如果数据在线性空间不可分，可以把输入映射到 feature space：

\[ \phi(\mathbf{x}). \]

Perceptron 权重可以写成训练样本的线性组合：

\[ \mathbf{w} = \sum_{n}\alpha_n y_n\phi(\mathbf{x}_n). \]

预测：

\[ \operatorname{sgn} \left( \sum_n \alpha_n y_n \phi(\mathbf{x}_n)^\top\phi(\mathbf{x}) \right). \]

用 kernel

\[ K(\mathbf{x}_n,\mathbf{x}) = \phi(\mathbf{x}_n)^\top\phi(\mathbf{x}) \]

即可避免显式构造 \(\phi\)：

\[ \hat{y} = \operatorname{sgn} \left( \sum_n\alpha_ny_nK(\mathbf{x}_n,\mathbf{x}) \right). \]

Theorem: Perceptron Weights Lie in the Span of Mistake Examples

Starting from \(\mathbf{w}_0=0\), every perceptron weight vector can be written as a linear combination of examples on which mistakes were made.

Proof

初始化 \(\mathbf{w}_0=0\)，显然在样本 span 中。每次更新为

\[ \mathbf{w}_{t+1} = \mathbf{w}_t+\eta y_t\mathbf{x}_t. \]

如果 \(\mathbf{w}_t\) 是过去 mistake examples 的线性组合，那么加上当前 mistake example 后，\(\mathbf{w}_{t+1}\) 仍然是 mistake examples 的线性组合。归纳成立。

Kernel perceptron 是从线性模型走向非线性模型的一条路；MLP 则是另一条路：不固定 kernel，而是用数据学习 feature map。

Multilayer Perceptron

MLP 可以看成把 perceptron 的线性阈值扩展成多层可微变换：

\[ \mathbf{h} = \phi(W_1\mathbf{x}+\mathbf{b}_1), \qquad \hat{\mathbf{y}} = W_2\mathbf{h}+\mathbf{b}_2. \]

hidden layer 的作用是学习一个 feature map，把原本线性不可分的数据映射到更容易分割的空间。XOR 的经典解法就是用 hidden units 表示两个中间概念：

\[ x_1 \land \neg x_2, \qquad \neg x_1 \land x_2. \]

然后输出层做 OR。

Explicit XOR Construction

用输入 \(x_1,x_2\in\{0,1\}\)，构造两个 hidden units：

\[ h_1=\mathbf{1}[x_1-x_2-0.5>0], \]

\[ h_2=\mathbf{1}[x_2-x_1-0.5>0]. \]

当 \((x_1,x_2)=(1,0)\)，\(h_1=1,h_2=0\)；当 \((0,1)\)，\(h_1=0,h_2=1\)；其他两个输入 \(h_1=h_2=0\)。输出层：

\[ \hat{y} = \mathbf{1}[h_1+h_2-0.5>0]. \]

这说明 XOR 并不是“神经元不能表示”，而是“单层线性阈值不能表示”。多层结构解决表达力问题，backprop 解决训练 credit assignment 问题。

Minimal Implementation

import numpy as np


def perceptron_train(x: np.ndarray, y: np.ndarray, epochs: int, lr: float):
    w = np.zeros(x.shape[1])
    b = 0.0
    for _ in range(epochs):
        mistakes = 0
        for xi, yi in zip(x, y):
            if yi * (w @ xi + b) <= 0:
                w += lr * yi * xi
                b += lr * yi
                mistakes += 1
        if mistakes == 0:
            break
    return w, b

这段代码就是 early neural learning 的骨架。今天我们训练 LLM 时用的 optimizer、loss scaling、distributed training 看上去复杂得多，但最小思想仍然是：定义一个可调参数化函数，用反馈信号把参数往正确行为方向推。

Averaged Perceptron Implementation

朴素 averaged perceptron 每一步保存 \(w_t\) 再求平均，成本高。可以在线累计：

import numpy as np


def averaged_perceptron_train(
    x: np.ndarray,
    y: np.ndarray,
    epochs: int,
    lr: float,
):
    w = np.zeros(x.shape[1])
    b = 0.0
    w_sum = np.zeros_like(w)
    b_sum = 0.0
    count = 0
    for _ in range(epochs):
        for xi, yi in zip(x, y):
            if yi * (w @ xi + b) <= 0:
                w += lr * yi * xi
                b += lr * yi
            w_sum += w
            b_sum += b
            count += 1
    return w_sum / count, b_sum / count

这在稀疏 NLP 特征中很常用，因为它几乎不增加训练复杂度，却能显著降低 online 顺序带来的方差。

Voted Perceptron

另一种方式是保存每段连续正确期间的权重和计数：

\[ \{(w^{(1)},c_1),\ldots,(w^{(K)},c_K)\}. \]

预测时投票：

\[ \hat{y} = \operatorname{sgn} \left( \sum_{k=1}^{K} c_k\operatorname{sgn}((w^{(k)})^\top x) \right). \]

averaged perceptron 可以看成 voted perceptron 的线性化近似：把多个 historical classifiers 合成一个平均权重。

Efficient Averaging With Timestamps

上面的 averaged implementation 每一步都执行 w_sum += w。如果特征是百万维稀疏向量，这种 dense 累加很浪费。经典 averaged perceptron 会用 timestamp trick：只有某个 feature 被更新时，才补齐它从上次更新到当前 step 的贡献。

设第 \(j\) 个权重当前值为 \(w_j\)，上次被 flush 的时间为 \(t_j\)，累计和为 \(a_j\)。当全局 step 从 \(t_j\) 走到 \(t\)，如果 \(w_j\) 中间没变，那么它对平均值的贡献是

\[ (t-t_j)w_j. \]

更新 feature \(j\) 前先执行：

\[ a_j\leftarrow a_j+(t-t_j)w_j, \qquad t_j\leftarrow t. \]

然后再改 \(w_j\)。训练结束时，对所有非零或曾经出现过的 feature 再 flush 一次：

\[ a_j\leftarrow a_j+(T-t_j)w_j, \qquad \bar w_j=\frac{a_j}{T}. \]

这个 trick 把每一步 dense averaging 变成只遍历当前样本的 active features。对稀疏 NLP 特征，它常常是 averaged perceptron 能否实用的关键。

from collections import defaultdict


class SparseAveragedWeights:
    def __init__(self):
        self.w = defaultdict(float)
        self.total = defaultdict(float)
        self.last = defaultdict(int)
        self.t = 0

    def _flush(self, key):
        self.total[key] += (self.t - self.last[key]) * self.w[key]
        self.last[key] = self.t

    def add(self, key, delta):
        self._flush(key)
        self.w[key] += delta

    def tick(self):
        self.t += 1

    def averaged(self):
        keys = set(self.w) | set(self.total)
        out = {}
        for key in keys:
            self._flush(key)
            out[key] = self.total[key] / max(self.t, 1)
        return out

Implementation Pattern: Lazy Averaging

For sparse online models, maintain cumulative weight-time products lazily with per-feature timestamps instead of adding a full dense vector at every step.

Structured Perceptron

Multiclass perceptron 仍然假设输出是一个类别。早期 NLP 很多任务的输出是结构：POS tag sequence、NER spans、dependency tree、translation alignment。Structured perceptron 把 perceptron 的 argmax 扩展到结构空间：

\[ \hat y = \arg\max_{y'\in\mathcal{Y}(x)} w^\top\Phi(x,y'), \]

其中 \(\Phi(x,y)\) 是输入和候选输出的 joint feature map。若预测结构 \(\hat y\) 与真实结构 \(y\) 不同，则更新：

\[ w \leftarrow w+\eta \left( \Phi(x,y)-\Phi(x,\hat y) \right). \]

Definition: Structured Perceptron

Structured perceptron predicts the highest-scoring structured output and updates toward the gold structure’s features and away from the predicted structure’s features.

这和多类 perceptron 是同一个公式：把类别 \(k\) 看成结构 \(y'\)，把 \(w_k^\top x\) 看成 \(w^\top\Phi(x,k)\)。区别在于 \(\mathcal{Y}(x)\) 可能指数大，不能枚举所有输出。

Sequence Tagging Example

对序列标注，输入是 tokens \(x_{1:T}\)，输出是 tags \(y_{1:T}\)。可以定义 feature：

\[ \Phi(x,y) = \sum_{t=1}^{T} \phi_{\text{emit}}(x,t,y_t) + \sum_{t=2}^{T} \phi_{\text{trans}}(y_{t-1},y_t). \]

score 分解为 emission score 与 transition score：

\[ s(x,y) = \sum_{t=1}^{T} E_t(y_t) + \sum_{t=2}^{T} A_{y_{t-1},y_t}. \]

预测需要求

\[ \hat y=\arg\max_y s(x,y), \]

可以用 Viterbi 动态规划，而不是枚举 \(K^T\) 个 tag 序列。

def viterbi(emission, transition):
    # emission: [T, K], transition: [K, K]
    T, K = emission.shape
    dp = emission[0].copy()
    back = []
    for t in range(1, T):
        scores = dp[:, None] + transition
        prev = scores.argmax(axis=0)
        dp = scores[prev, range(K)] + emission[t]
        back.append(prev)

    tag = int(dp.argmax())
    path = [tag]
    for prev in reversed(back):
        tag = int(prev[tag])
        path.append(tag)
    return list(reversed(path))

Structured perceptron 的训练循环就是：

用 Viterbi 找当前最高分结构 \(\hat y\)；
若 \(\hat y\ne y\)，加上 gold features；
减去 predicted features；
对权重做 averaging。

Pitfall: Inference Is Part of the Model

For structured perceptron, the decoder used during training defines the update. Approximate decoding, beam search, or different tie-breaking changes the learning algorithm.

Cost-Augmented Update

如果只在 \(\hat y\ne y\) 时更新，所有错误结构被同等对待。但有些错误更严重：NER 中把整段实体漏掉，比把 B-PER 写成 I-PER 更坏。可以用 cost-augmented decoding：

\[ \hat y = \arg\max_{y'\in\mathcal{Y}(x)} \left[ w^\top\Phi(x,y')+\Delta(y,y') \right], \]

其中 \(\Delta(y,y')\) 是任务损失。这样模型会主动寻找“高分且代价大”的错误结构，再把它压下去。这个思想后来进入 structured SVM、max-margin parsing 和很多 task-specific loss。

Dual and Kernel Implementation Details

Kernel perceptron 不显式维护 \(w\)，而是维护 mistake counts \(\alpha_i\)。若第 \(i\) 个样本被错分：

\[ \alpha_i\leftarrow \alpha_i+1. \]

预测第 \(m\) 个样本：

\[ s_m = \sum_{i:\alpha_i>0} \alpha_i y_i K(x_i,x_m). \]

最小实现：

def kernel_perceptron_train(x, y, kernel, epochs):
    alpha = np.zeros(len(x), dtype=np.int64)
    for _ in range(epochs):
        mistakes = 0
        for m, xm in enumerate(x):
            score = 0.0
            for i, ai in enumerate(alpha):
                if ai:
                    score += ai * y[i] * kernel(x[i], xm)
            if y[m] * score <= 0:
                alpha[m] += 1
                mistakes += 1
        if mistakes == 0:
            break
    return alpha

计算成本是一个现实问题：若 support mistakes 很多，单次预测要对许多历史样本算 kernel。kernel perceptron 的表达力更强，但把参数向量换成了样本记忆。

Pitfall: Kernel Methods Move Cost From Parameters to Examples

Kernel perceptron avoids explicit feature maps, but prediction cost grows with the number of stored mistake examples.

Gram Matrix, Cache, and Budget

若训练集有 \(N\) 个样本，预计算 Gram matrix

\[ G_{ij}=K(x_i,x_j) \]

可以把每次 kernel 查询变成数组读取，但内存是 \(O(N^2)\)。当 \(N=100{,}000\) 时，float32 Gram matrix 需要

\[ N^2\cdot4 = 40\text{ GB}. \]

这还没算 alpha、labels、features 和其他缓存。因此 kernel perceptron 的工程选择通常是三选一：

Strategy	Memory	Compute	When useful
no cache	low	high	expensive memory, cheap kernel
full Gram matrix	\(O(N^2)\)	low lookup cost	small datasets
LRU kernel cache	bounded	medium	repeated support queries

support set 大小时，预测复杂度近似为

\[ O(|S|\cdot C_K), \]

其中 \(|S|=|\{i:\alpha_i>0\}|\)，\(C_K\) 是一次 kernel 计算成本。若用 polynomial/RBF kernel，\(C_K\) 通常至少和 feature dimension 成正比；如果 \(|S|\) 接近 \(N\)，预测就变成 memory-based method。

一种 practical budget perceptron 会限制 support 数量：

若新样本犯错，加入 support；
若 support 数超过预算 \(B\)，删除一个 support；
删除策略可以是 oldest、smallest \(\alpha_i\)、或者近似冗余样本。

这会破坏经典 convergence proof，但让模型有固定推理成本：

\[ O(B\cdot C_K). \]

Definition: Budgeted Kernel Perceptron

A budgeted kernel perceptron limits the number of stored support examples so that prediction cost remains bounded.

这条线后来和 support vector machine、online kernel learning、nearest-neighbor memory 都有关系：非参数表达力强，但系统成本会跟训练样本数绑定。

Sparse Feature Engineering

早期 perceptron 在 NLP 中好用，一个重要原因是它适合稀疏离散特征。输入不是 dense float vector，而是 feature map：

\[ x=\{f_1:1,\; f_2:1,\; f_3:2,\ldots\}. \]

score 只需要遍历 active features：

\[ w^\top x = \sum_{f\in\operatorname{active}(x)}w_f x_f. \]

更新也只改 active features：

\[ w_f\leftarrow w_f+\eta yx_f. \]

这让每个样本的成本从 \(O(d)\) 变成 \(O(\operatorname{nnz}(x))\)。如果词表、n-gram、模板特征有几百万维，但每个句子只激活几千个特征，稀疏 perceptron 仍然很快。

def sparse_score(weights, features):
    return sum(weights.get(name, 0.0) * value for name, value in features.items())


def sparse_update(weights, features, scale):
    for name, value in features.items():
        weights[name] = weights.get(name, 0.0) + scale * value

Feature Hashing

为了避免维护巨大字符串到 id 的词典，可以用 hashing trick：

\[ h(f)\in\{0,\ldots,D-1\}. \]

所有特征写进固定长度数组。若再用 signed hashing：

\[ s(f)\in\{-1,+1\}, \]

则特征贡献为

\[ x_{h(f)}\leftarrow x_{h(f)}+s(f)\cdot v_f. \]

hash collision 会让不同特征共享权重，这是 bias；但它换来固定内存和简单部署。在线模型里这很常见，因为新特征可以不经过 vocabulary rebuild 直接进入模型。

Pitfall: Feature Scale Controls the Mistake Bound

The perceptron mistake bound depends on \(R=\max_n\|x_n\|\). Unnormalized count features, long documents, or high-degree polynomial features can increase \(R\) and make updates unstable.

Implementation Checklist

训练 perceptron 或阅读相关算法时，检查：

label 是否使用 \(\{-1,+1\}\) 而不是 \(\{0,1\}\)；
bias 是否合并进 augmented feature；
feature norm \(R\) 是否被归一化；
数据是否线性可分；
是否需要 margin update 而不是只在错分时更新；
非可分数据是否保存 best/averaged weights；
kernel perceptron 的 support set 是否过大；
多类/结构化任务的 argmax 或 decoder 是否和训练一致；
averaged perceptron 是否用 lazy timestamp 避免 dense 累加；
kernel cache 或 support budget 是否符合推理成本；
sparse/hash features 是否记录了 normalization 和 collision policy；
评估是否受样本顺序影响。

From Perceptron to Modern Classifiers

Perceptron 留下了三条线：

Perceptron idea	Modern continuation
linear score \(\mathbf{w}^\top x\)	logits
mistake-driven update	margin-based losses
kernel trick	similarity models and implicit features
hidden feature map	MLP/deep networks
online update	streaming and continual learning

它的局限也同样重要：hard threshold 不可微，单层模型表达力弱，非可分数据上不会收敛。现代深度学习并不是抛弃 perceptron，而是把它的线性打分器放进可微多层表示中，再用稳定的 loss 和 optimizer 训练。