2.3 Convolutional Neural Networks


CNN 的核心不是“图像模型用卷积”,而是两个归纳偏置:

  1. locality: 附近像素更相关;
  2. translation equivariance: 输入平移后,特征图也平移。

这两个假设把一个巨大的全连接层压缩成共享的小卷积核,让图像、语音、局部序列等数据可以用更少参数建模。

From Fully Connected to Convolution

设输入图像 \(X\in\mathbb{R}^{H\times W}\),输出特征图 \(Y\in\mathbb{R}^{H'\times W'}\)。完全一般的线性层可以写成:

\[ Y_{i,j} = b_{i,j} + \sum_{u,v} W_{i,j,u,v}X_{u,v}. \]

这里每个输出位置 \((i,j)\) 都有自己的一套权重,参数量巨大。引入 locality,把 \(u,v\) 写成相对偏移:

\[ Y_{i,j} = b_{i,j} + \sum_{a,b} V_{i,j,a,b}X_{i+a,j+b}. \]

再引入 translation equivariance,让权重不依赖绝对位置:

\[ V_{i,j,a,b}=K_{a,b}, \qquad b_{i,j}=b. \]

得到卷积/互相关形式:

\[ Y_{i,j} = b+ \sum_{a,b} K_{a,b}X_{i+a,j+b}. \]

NoteDefinition: Translation Equivariance

A mapping \(F\) is translation equivariant if translating the input translates the output: \(F(T_\delta x)=T_\delta F(x)\).

ImportantTheorem: Shared Convolution Is Translation Equivariant

For infinite or properly padded grids, the cross-correlation operator

\[ (F_KX)_{i,j}=\sum_{a,b}K_{a,b}X_{i+a,j+b} \]

satisfies \(F_K(T_\delta X)=T_\delta(F_KX)\).

Proof

设平移算子满足 \((T_\delta X)_{i,j}=X_{i+\delta_1,j+\delta_2}\)。则

\[ \begin{aligned} (F_K(T_\delta X))_{i,j} &= \sum_{a,b}K_{a,b}(T_\delta X)_{i+a,j+b}\\ &= \sum_{a,b}K_{a,b}X_{i+a+\delta_1,j+b+\delta_2}\\ &= (F_KX)_{i+\delta_1,j+\delta_2}\\ &= (T_\delta F_KX)_{i,j}. \end{aligned} \]

关键不是 kernel 的具体数值,而是同一个 \(K\) 在所有位置共享。若每个位置都有自己的 \(K_{i,j}\),上面的等式一般就不成立。

CNN 的参数节省来自 weight sharing。若输出有 \(H'W'\) 个位置,kernel 为 \(k_hk_w\),全连接式局部层参数约为 \(H'W'k_hk_w\);共享卷积核只需要 \(k_hk_w\)

Cross-Correlation vs. Mathematical Convolution

深度学习库里的 Conv2d 通常实现 cross-correlation:

\[ Y_{i,j} = \sum_{a,b}K_{a,b}X_{i+a,j+b}, \]

数学卷积会翻转 kernel:

\[ Y_{i,j} = \sum_{a,b}K_{a,b}X_{i-a,j-b}. \]

这不影响学习能力,因为 kernel 是可学习参数;训练会学到需要的方向。

import torch
import torch.nn.functional as F


def corr2d(x, k):
    kh, kw = k.shape
    oh = x.shape[0] - kh + 1
    ow = x.shape[1] - kw + 1
    y = torch.empty(oh, ow)
    for i in range(oh):
        for j in range(ow):
            y[i, j] = (x[i:i + kh, j:j + kw] * k).sum()
    return y

Output Size

对单个空间维度,输入长度 \(n\),kernel size \(k\),padding \(p\),stride \(s\),dilation \(d\)。effective kernel size:

\[ k_{\text{eff}} = d(k-1)+1. \]

输出长度:

\[ n_{\text{out}} = \left\lfloor \frac{n+2p-k_{\text{eff}}}{s} \right\rfloor +1. \]

WarningPitfall: Output Size Uses Effective Kernel

Dilation changes the effective kernel size. The output-size formula must use \(d(k-1)+1\), not just \(k\).

Padding Semantics

Padding 不是纯粹的 shape trick,它隐含了边界条件。常见选择:

Padding Boundary assumption Effect
zero image outside boundary is black / absent simplest, but border statistics change
reflect boundary mirrors back common in restoration tasks
replicate boundary value repeats preserves edge value, can create flat borders
circular periodic boundary useful for synthetic periodic signals

对 odd kernel、stride \(1\)、dilation \(d\),若想保持 \(n_{\text{out}}=n\),通常取

\[ p=\frac{d(k-1)}{2}. \]

\(d(k-1)\) 是奇数时,无法左右完全对称 padding,只能 asymmetric padding。PyTorch 的 Conv2d(padding=p) 只能表达对称 padding;非对称时要显式 F.pad

import torch.nn.functional as F


def conv2d_same_asym(x, conv, pad_left, pad_right, pad_top, pad_bottom):
    x = F.pad(x, (pad_left, pad_right, pad_top, pad_bottom))
    return conv(x)
WarningPitfall: Same Padding Is Not Always Causal or Centered

same 只说明输出 shape,不能说明信息流。序列任务的 causal convolution 必须只 left pad;图像任务的 symmetric padding 才对应以当前像素为中心的局部窗口。

Channels and Parameter Count

真实图像有 channel。输入:

X: [B, C_in, H, W]

卷积权重:

W: [C_out, C_in, K_h, K_w]

输出:

Y: [B, C_out, H_out, W_out]

公式:

\[ Y_{b,c_o,i,j} = b_{c_o} + \sum_{c_i=1}^{C_{\text{in}}} \sum_{a,b} W_{c_o,c_i,a,b} X_{b,c_i,i+a,j+b}. \]

参数量:

\[ N_{\text{param}} = C_{\text{out}}C_{\text{in}}K_hK_w + \mathbf{1}_{\text{bias}}C_{\text{out}}. \]

1x1 convolution 是每个空间位置共享的 channel mixing:

\[ Y_{b,:,i,j} = WX_{b,:,i,j}+b. \]

它不混合空间,只混合 channel,常用于升降维和瓶颈结构。

Tensor Layout and Memory Format

PyTorch 的逻辑 layout 通常写作 [B, C, H, W],但底层内存可以是 contiguous NCHW,也可以是 channels_last NHWC-like memory format。二者张量 shape 相同,stride 不同:

x = torch.randn(8, 64, 56, 56)
x_cl = x.to(memory_format=torch.channels_last)
print(x.stride())
print(x_cl.stride())

卷积 kernel 看到的是同一个数学对象;性能差异来自 memory access pattern 和底层 kernel。一般经验:

Setting Often faster with
CUDA + AMP/BF16/FP16 CNN channels_last
CPU small tensors benchmark required
custom ops expecting contiguous NCHW plain contiguous
WarningPitfall: Shape Equality Does Not Mean Layout Equality

x.shape == x_cl.shape does not imply the same memory format. A custom CUDA/Triton op or a .view(...) call may silently assume contiguous NCHW. Prefer .reshape(...) when layout may be non-contiguous, and profile before changing global memory format.

Receptive Field

receptive field 描述输出位置能看到输入的多大范围。令第 \(\ell\) 层 kernel 为 \(k_\ell\),stride 为 \(s_\ell\),dilation 为 \(d_\ell\)。定义 jump \(j_\ell\) 为输出相邻位置在原输入上的间隔,receptive field \(r_\ell\) 为单个输出位置覆盖的输入范围:

\[ j_\ell=j_{\ell-1}s_\ell, \qquad r_\ell=r_{\ell-1}+(k_\ell-1)d_\ell j_{\ell-1}. \]

初始:

\[ j_0=1,\qquad r_0=1. \]

例如两层 \(3\times3\) stride-1 convolution 的 receptive field 是 \(5\times5\),不是 \(6\times6\),因为中心像素重叠。

Receptive Field Center

只知道 \(r_\ell\) 还不够,调试检测/分割网络时还要知道输出格点中心落在原图哪里。用一维写法,令 \(a_\ell\) 表示第 \(\ell\) 层第 \(0\) 个输出位置对应的原输入中心坐标。若 padding 为 \(p_\ell\),则

\[ a_\ell = a_{\ell-1} + \left(\frac{k_{\text{eff},\ell}-1}{2}-p_\ell\right)j_{\ell-1}, \qquad k_{\text{eff},\ell}=d_\ell(k_\ell-1)+1. \]

初始 \(a_0=0.5\) 表示第一个像素中心。完整递推:

\[ \begin{aligned} j_\ell &= j_{\ell-1}s_\ell,\\ r_\ell &= r_{\ell-1}+(k_{\text{eff},\ell}-1)j_{\ell-1},\\ a_\ell &= a_{\ell-1}+\left(\frac{k_{\text{eff},\ell}-1}{2}-p_\ell\right)j_{\ell-1}. \end{aligned} \]

def rf_table(layers):
    # layers: list of dicts with k, s, p, d
    jump, field, start = 1, 1, 0.5
    rows = []
    for layer in layers:
        k = layer["k"]
        s = layer.get("s", 1)
        p = layer.get("p", 0)
        d = layer.get("d", 1)
        k_eff = d * (k - 1) + 1
        start = start + ((k_eff - 1) / 2 - p) * jump
        field = field + (k_eff - 1) * jump
        jump = jump * s
        rows.append({"jump": jump, "rf": field, "start": start})
    return rows

直观上,jump 是 feature map 一个格子对应原图移动几像素,rf 是看到多大窗口,start 是对齐是否偏半格。很多 dense prediction 的 bug 不是模型“不收敛”,而是上采样、padding、stride 组合后输出和 label 对不齐。

Pooling

Pooling 是固定窗口聚合:

\[ Y_{i,j} = \max_{(a,b)\in\mathcal{W}}X_{i+a,j+b} \]

或平均:

\[ Y_{i,j} = \frac{1}{|\mathcal{W}|} \sum_{(a,b)\in\mathcal{W}}X_{i+a,j+b}. \]

MaxPool 保留局部最强响应;AvgPool 更像低通平滑。现代网络也常用 strided convolution 替代 pooling,让下采样也可学习。

Dilation and Causal CNN

Dilated convolution 在 kernel 元素之间插入间隔。它扩大 receptive field,但不增加参数量:

k=3, dilation=2 sees positions [i-2, i, i+2]

在序列建模中,causal convolution 要保证输出位置 \(t\) 只依赖 \(x_{\leq t}\)

\[ y_t = \sum_{a=0}^{k-1} w_a x_{t-a}. \]

实现时通常 left padding:

class CausalConv1d(torch.nn.Module):
    def __init__(self, channels, kernel_size, dilation):
        super().__init__()
        self.pad = dilation * (kernel_size - 1)
        self.conv = torch.nn.Conv1d(
            channels,
            channels,
            kernel_size,
            dilation=dilation,
        )

    def forward(self, x):
        x = F.pad(x, (self.pad, 0))
        return self.conv(x)

Depthwise Separable Convolution

普通卷积参数量:

\[ C_{\text{out}}C_{\text{in}}K_hK_w. \]

depthwise separable convolution 拆成:

  1. depthwise: 每个输入 channel 单独做空间卷积;
  2. pointwise: 用 \(1\times1\) 卷积混合 channel。

参数量:

\[ C_{\text{in}}K_hK_w + C_{\text{out}}C_{\text{in}}. \]

\(K_hK_w\)\(C_{\text{out}}\) 都不小时,节省很明显。这是 MobileNet 等轻量模型的核心。

Grouped Convolution as Block-Sparse Connectivity

groups=g 可以看成把 channel 分成 \(g\) 组,每组只连接自己的输入和输出。若 \(C_{\text{in}}\)\(C_{\text{out}}\) 都能被 \(g\) 整除,则参数量为

\[ N_{\text{param}} = C_{\text{out}}\frac{C_{\text{in}}}{g}K_hK_w. \]

从矩阵角度看,普通卷积在 channel 维是 dense mixing,grouped convolution 是 block diagonal mixing。depthwise convolution 是 \(g=C_{\text{in}}\) 的极端情况;它只做空间滤波,不做跨 channel mixing,所以通常要接一个 1x1 pointwise conv。

NoteDefinition: Depthwise Multiplier

In depthwise convolution, a depth multiplier \(m\) means each input channel produces \(m\) output channels. Then \(C_{\text{out}}=mC_{\text{in}}\) and groups=C_in.

depthwise = torch.nn.Conv2d(
    in_channels=64,
    out_channels=64,
    kernel_size=3,
    padding=1,
    groups=64,
)
pointwise = torch.nn.Conv2d(64, 128, kernel_size=1)

Convolution as Matrix Multiplication

卷积可以通过 unfold/im2col 转为矩阵乘。输入 patch:

patches: [B, C_in*K_h*K_w, H_out*W_out]
weight:  [C_out, C_in*K_h*K_w]

则:

\[ Y = W\cdot \operatorname{unfold}(X). \]

patches = torch.nn.Unfold(kernel_size=3, padding=1)(x)
w = conv.weight.reshape(conv.out_channels, -1)
y = w @ patches

真实库不会总是显式 materialize im2col,因为它可能占很多内存;cuDNN/Triton kernels 会根据 shape 选择 direct、implicit GEMM、FFT、Winograd 等策略。

im2col Memory Bill

显式 unfold 后的临时张量大小是

\[ B\cdot C_{\text{in}}K_hK_w\cdot H_{\text{out}}W_{\text{out}}. \]

比如 \(B=32,C_{\text{in}}=64,H=W=224,K=3\),stride \(1\) 且 same padding,则元素数约为

\[ 32\cdot64\cdot9\cdot224\cdot224 \approx 9.25\times10^8. \]

FP16 也要约 \(1.85\)GB 临时内存。这解释了为什么教材里 im2col + GEMM 很清楚,但工业 kernel 经常使用 implicit GEMM:逻辑上像展开,物理上不 materialize 完整 patch matrix。

Backpropagation Through Convolution

设 loss 对输出的梯度为

\[ G_{b,c_o,i,j} = \frac{\partial L}{\partial Y_{b,c_o,i,j}}. \]

对权重的梯度:

\[ \frac{\partial L}{\partial W_{c_o,c_i,a,b}} = \sum_{b_0,i,j} G_{b_0,c_o,i,j} X_{b_0,c_i,i+a,j+b}. \]

它也是一种 cross-correlation:用输入 patch 和输出梯度做相关。对输入的梯度:

\[ \frac{\partial L}{\partial X_{b_0,c_i,u,v}} = \sum_{c_o,a,b} G_{b_0,c_o,u-a,v-b} W_{c_o,c_i,a,b}, \]

也就是把 output gradient 通过 kernel 的转置线性映射传播回输入。所谓 transposed convolution 正是这个线性代数关系的前向版本。

NoteDefinition: Conv2d Backward Operators

grad_weight correlates input patches with output gradients. grad_input applies the transpose of the convolution’s linear map to output gradients.

可以用 autograd 做一个 sanity check:手写 unfold 版本和 Conv2d 应该给出近似相同的前向与梯度。

import torch
import torch.nn.functional as F


def conv2d_unfold(x, weight, bias=None, padding=0, stride=1):
    bsz = x.shape[0]
    cout = weight.shape[0]
    kh, kw = weight.shape[-2:]
    patches = F.unfold(x, kernel_size=(kh, kw), padding=padding, stride=stride)
    y = weight.reshape(cout, -1) @ patches
    if bias is not None:
        y = y + bias.view(1, -1, 1)
    h_out = int((x.shape[-2] + 2 * padding - kh) / stride + 1)
    w_out = int((x.shape[-1] + 2 * padding - kw) / stride + 1)
    return y.view(bsz, cout, h_out, w_out)
WarningPitfall: Backward Shapes Follow the Forward Choices

Stride, dilation, padding, and groups all affect backward. If a custom convolution forward silently changes padding or channel grouping, gradients can have correct shapes but wrong semantics.

PyTorch Conv2d

conv = torch.nn.Conv2d(
    in_channels=3,
    out_channels=64,
    kernel_size=3,
    stride=1,
    padding=1,
    dilation=1,
    groups=1,
    bias=True,
)

groups 控制 channel 连接:

groups Meaning
1 normal convolution
C_in with C_out=C_in depthwise convolution
between grouped convolution

weight shape 仍然是:

[C_out, C_in / groups, K_h, K_w]

这点调试 grouped conv 时很容易错。

Initialization for Convolution

卷积层的 fan-in 不是 \(C_{\text{in}}\),而是一个输出位置实际汇聚的输入数量:

\[ \operatorname{fan\_in} = \frac{C_{\text{in}}}{g}K_hK_w, \qquad \operatorname{fan\_out} = \frac{C_{\text{out}}}{g}K_hK_w. \]

对 ReLU-like activation,Kaiming initialization 常用

\[ \operatorname{Var}(W) \approx \frac{2}{\operatorname{fan\_in}}. \]

这和 MLP 的推导一样:希望前向激活方差不要层层爆炸或消失,只是每个输出位置的输入维度变成了局部窗口大小。

conv = torch.nn.Conv2d(64, 128, kernel_size=3, padding=1, groups=1)
torch.nn.init.kaiming_normal_(conv.weight, mode="fan_out", nonlinearity="relu")

fan_in 更强调前向方差,fan_out 更强调反向梯度方差。CNN 分类模型里很多 ResNet 实现用 fan_out,因为深层残差网络的梯度传播更敏感。

BatchNorm and Conv-BN Folding

BatchNorm2d 对每个 channel 独立归一化。训练时:

\[ \hat{x}_{b,c,i,j} = \frac{x_{b,c,i,j}-\mu_c}{\sqrt{\sigma_c^2+\epsilon}}, \qquad y_{b,c,i,j} = \gamma_c\hat{x}_{b,c,i,j}+\beta_c. \]

其中 \(\mu_c,\sigma_c^2\) 在 batch 和 spatial dimensions 上统计:

\[ \mu_c = \frac{1}{BHW}\sum_{b,i,j}x_{b,c,i,j}. \]

推理时使用 running mean/variance。若前一层是 convolution:

\[ z=W*x+b, \qquad y_c= \gamma_c\frac{z_c-\mu_c}{\sqrt{\sigma_c^2+\epsilon}}+\beta_c, \]

则可以把 BN fold 进 convolution:

\[ W'_c = \frac{\gamma_c}{\sqrt{\sigma_c^2+\epsilon}}W_c, \qquad b'_c = \frac{\gamma_c}{\sqrt{\sigma_c^2+\epsilon}}(b_c-\mu_c)+\beta_c. \]

NoteDefinition: Conv-BN Folding

Conv-BN folding replaces a Conv2d -> BatchNorm2d pair by one equivalent Conv2d during inference, using BatchNorm running statistics.

这对部署很重要:fold 后少一次 memory read/write,也更容易被推理引擎 fuse。训练时不能简单 fold,因为 BN 的 batch statistics 依赖当前 mini-batch。

def fold_conv_bn(conv, bn):
    scale = bn.weight / torch.sqrt(bn.running_var + bn.eps)
    w = conv.weight * scale.view(-1, 1, 1, 1)
    if conv.bias is None:
        bias = torch.zeros_like(bn.running_mean)
    else:
        bias = conv.bias
    b = (bias - bn.running_mean) * scale + bn.bias
    return w, b
WarningPitfall: BatchNorm Depends on Mode

model.train() uses batch statistics and updates running statistics. model.eval() uses stored running statistics. Many validation accuracy jumps are just BatchNorm/Dropout mode bugs.

Residual Blocks

一个基本 residual block 写成

\[ y=x+F(x). \]

反向传播时

\[ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} + \left(\frac{\partial F}{\partial x}\right)^\top \frac{\partial L}{\partial y}. \]

即使 \(F\) 的 Jacobian 暂时不好,identity path 也给梯度提供一条直接通路。这是 ResNet 比 plain deep CNN 更容易训练的核心。

当 spatial size 或 channel 改变时,不能直接相加,需要 projection shortcut:

class BasicBlock(torch.nn.Module):
    def __init__(self, cin, cout, stride):
        super().__init__()
        self.conv1 = torch.nn.Conv2d(cin, cout, 3, stride=stride, padding=1, bias=False)
        self.bn1 = torch.nn.BatchNorm2d(cout)
        self.conv2 = torch.nn.Conv2d(cout, cout, 3, padding=1, bias=False)
        self.bn2 = torch.nn.BatchNorm2d(cout)
        if cin == cout and stride == 1:
            self.proj = torch.nn.Identity()
        else:
            self.proj = torch.nn.Sequential(
                torch.nn.Conv2d(cin, cout, 1, stride=stride, bias=False),
                torch.nn.BatchNorm2d(cout),
            )

    def forward(self, x):
        h = F.relu(self.bn1(self.conv1(x)))
        h = self.bn2(self.conv2(h))
        return F.relu(h + self.proj(x))

Pre-activation ResNet 把 BN/ReLU 放到 convolution 前,使 residual branch 的最后一步不再被 ReLU 截断:

post-activation: conv -> bn -> relu -> conv -> bn -> add -> relu
pre-activation:  bn -> relu -> conv -> bn -> relu -> conv -> add

对很深的网络,pre-activation 更接近“identity path 完全干净”的设计;对较浅网络,差异可能不明显,但理解这个结构有助于读现代 vision backbone。

Upsampling and Transposed Convolution

上采样常见方法:

Method Learnable? Risk
nearest/bilinear interpolate no blurry or blocky
resize + conv conv learnable more stable
transposed conv yes checkerboard artifacts

Transposed convolution 不是严格的“反卷积”,而是普通卷积对输入的线性映射矩阵的转置。它可以学习上采样,但 stride/kernel 配置不当时会造成 uneven overlap。

Transposed Convolution Output Size

一维情形下,ConvTranspose 输出长度为

\[ n_{\text{out}} = (n_{\text{in}}-1)s -2p +d(k-1) +\operatorname{output\_padding} +1. \]

这里 output_padding 不是在输出末尾补零,而是在多个可能输出 shape 中选择一个。它只解决 shape ambiguity,不解决 checkerboard artifacts。

为什么会有 checkerboard?stride \(s>1\) 时,transposed convolution 可以理解为先在输入位置之间插入 \(s-1\) 个零,再做普通卷积。若 kernel size 不能被 stride 整除,不同输出位置被覆盖的次数不同:

stride=2, kernel=3
coverage pattern: 1,2,1,2,1,2,...

这会让某些像素天然收到更多累加项。常见缓解:

  1. interpolate(..., mode="nearest" or "bilinear") + Conv2d
  2. 选择 kernel size 可被 stride 整除的配置;
  3. 在生成模型中配合 normalization 和 anti-aliasing 设计;
  4. 用 pixel shuffle / sub-pixel convolution。

Pixel Shuffle

Pixel shuffle 先用 convolution 生成 \(r^2C\) 个 channel,再把 channel 重排为空间分辨率:

[B, C*r*r, H, W] -> [B, C, H*r, W*r]

它没有插零操作,常用于 super-resolution。它的关键假设是 channel 维里已经学好了每个 sub-pixel 的内容。

up = torch.nn.Sequential(
    torch.nn.Conv2d(64, 64 * 4, kernel_size=3, padding=1),
    torch.nn.PixelShuffle(upscale_factor=2),
)
WarningPitfall: Upsampling Must Match the Label Grid

For segmentation and restoration, a one-pixel alignment error can look like poor model quality. Track stride, padding, and crop conventions from input image to label grid.

Implementation Checklist

写 CNN 时检查:

  1. 输入 layout 是否为 [B,C,H,W]
  2. Conv2d.weight shape 是否按 [C_out,C_in/groups,K_h,K_w] 理解;
  3. output size 是否使用 dilation 后的 effective kernel;
  4. padding 是否符合任务边界假设;
  5. stride/pooling 是否过早丢失空间分辨率;
  6. receptive field 是否覆盖任务所需上下文;
  7. grouped/depthwise conv 的 channel 数是否可整除;
  8. causal conv 是否只 left pad;
  9. upsampling 是否避免 checkerboard artifacts;
  10. BatchNorm 是否在 train/eval mode 下语义正确;
  11. Conv-BN folding 是否只用于 inference;
  12. residual add 两侧 shape、stride、dtype 是否一致;
  13. channels_last、cuDNN benchmark、AMP 是否经过 profile 验证;
  14. 自定义卷积/上采样是否做过前向与梯度的数值对齐检查。