2.3 Convolutional Neural Networks
CNN 的核心不是“图像模型用卷积”,而是两个归纳偏置:
- locality: 附近像素更相关;
- translation equivariance: 输入平移后,特征图也平移。
这两个假设把一个巨大的全连接层压缩成共享的小卷积核,让图像、语音、局部序列等数据可以用更少参数建模。
From Fully Connected to Convolution
设输入图像 \(X\in\mathbb{R}^{H\times W}\),输出特征图 \(Y\in\mathbb{R}^{H'\times W'}\)。完全一般的线性层可以写成:
\[ Y_{i,j} = b_{i,j} + \sum_{u,v} W_{i,j,u,v}X_{u,v}. \]
这里每个输出位置 \((i,j)\) 都有自己的一套权重,参数量巨大。引入 locality,把 \(u,v\) 写成相对偏移:
\[ Y_{i,j} = b_{i,j} + \sum_{a,b} V_{i,j,a,b}X_{i+a,j+b}. \]
再引入 translation equivariance,让权重不依赖绝对位置:
\[ V_{i,j,a,b}=K_{a,b}, \qquad b_{i,j}=b. \]
得到卷积/互相关形式:
\[ Y_{i,j} = b+ \sum_{a,b} K_{a,b}X_{i+a,j+b}. \]
A mapping \(F\) is translation equivariant if translating the input translates the output: \(F(T_\delta x)=T_\delta F(x)\).
For infinite or properly padded grids, the cross-correlation operator
\[ (F_KX)_{i,j}=\sum_{a,b}K_{a,b}X_{i+a,j+b} \]
satisfies \(F_K(T_\delta X)=T_\delta(F_KX)\).
Proof
设平移算子满足 \((T_\delta X)_{i,j}=X_{i+\delta_1,j+\delta_2}\)。则
\[ \begin{aligned} (F_K(T_\delta X))_{i,j} &= \sum_{a,b}K_{a,b}(T_\delta X)_{i+a,j+b}\\ &= \sum_{a,b}K_{a,b}X_{i+a+\delta_1,j+b+\delta_2}\\ &= (F_KX)_{i+\delta_1,j+\delta_2}\\ &= (T_\delta F_KX)_{i,j}. \end{aligned} \]
关键不是 kernel 的具体数值,而是同一个 \(K\) 在所有位置共享。若每个位置都有自己的 \(K_{i,j}\),上面的等式一般就不成立。
CNN 的参数节省来自 weight sharing。若输出有 \(H'W'\) 个位置,kernel 为 \(k_hk_w\),全连接式局部层参数约为 \(H'W'k_hk_w\);共享卷积核只需要 \(k_hk_w\)。
Cross-Correlation vs. Mathematical Convolution
深度学习库里的 Conv2d 通常实现 cross-correlation:
\[ Y_{i,j} = \sum_{a,b}K_{a,b}X_{i+a,j+b}, \]
数学卷积会翻转 kernel:
\[ Y_{i,j} = \sum_{a,b}K_{a,b}X_{i-a,j-b}. \]
这不影响学习能力,因为 kernel 是可学习参数;训练会学到需要的方向。
import torch
import torch.nn.functional as F
def corr2d(x, k):
kh, kw = k.shape
oh = x.shape[0] - kh + 1
ow = x.shape[1] - kw + 1
y = torch.empty(oh, ow)
for i in range(oh):
for j in range(ow):
y[i, j] = (x[i:i + kh, j:j + kw] * k).sum()
return yOutput Size
对单个空间维度,输入长度 \(n\),kernel size \(k\),padding \(p\),stride \(s\),dilation \(d\)。effective kernel size:
\[ k_{\text{eff}} = d(k-1)+1. \]
输出长度:
\[ n_{\text{out}} = \left\lfloor \frac{n+2p-k_{\text{eff}}}{s} \right\rfloor +1. \]
Dilation changes the effective kernel size. The output-size formula must use \(d(k-1)+1\), not just \(k\).
Padding Semantics
Padding 不是纯粹的 shape trick,它隐含了边界条件。常见选择:
| Padding | Boundary assumption | Effect |
|---|---|---|
| zero | image outside boundary is black / absent | simplest, but border statistics change |
| reflect | boundary mirrors back | common in restoration tasks |
| replicate | boundary value repeats | preserves edge value, can create flat borders |
| circular | periodic boundary | useful for synthetic periodic signals |
对 odd kernel、stride \(1\)、dilation \(d\),若想保持 \(n_{\text{out}}=n\),通常取
\[ p=\frac{d(k-1)}{2}. \]
当 \(d(k-1)\) 是奇数时,无法左右完全对称 padding,只能 asymmetric padding。PyTorch 的 Conv2d(padding=p) 只能表达对称 padding;非对称时要显式 F.pad:
import torch.nn.functional as F
def conv2d_same_asym(x, conv, pad_left, pad_right, pad_top, pad_bottom):
x = F.pad(x, (pad_left, pad_right, pad_top, pad_bottom))
return conv(x)same 只说明输出 shape,不能说明信息流。序列任务的 causal convolution 必须只 left pad;图像任务的 symmetric padding 才对应以当前像素为中心的局部窗口。
Channels and Parameter Count
真实图像有 channel。输入:
X: [B, C_in, H, W]
卷积权重:
W: [C_out, C_in, K_h, K_w]
输出:
Y: [B, C_out, H_out, W_out]
公式:
\[ Y_{b,c_o,i,j} = b_{c_o} + \sum_{c_i=1}^{C_{\text{in}}} \sum_{a,b} W_{c_o,c_i,a,b} X_{b,c_i,i+a,j+b}. \]
参数量:
\[ N_{\text{param}} = C_{\text{out}}C_{\text{in}}K_hK_w + \mathbf{1}_{\text{bias}}C_{\text{out}}. \]
1x1 convolution 是每个空间位置共享的 channel mixing:
\[ Y_{b,:,i,j} = WX_{b,:,i,j}+b. \]
它不混合空间,只混合 channel,常用于升降维和瓶颈结构。
Tensor Layout and Memory Format
PyTorch 的逻辑 layout 通常写作 [B, C, H, W],但底层内存可以是 contiguous NCHW,也可以是 channels_last NHWC-like memory format。二者张量 shape 相同,stride 不同:
x = torch.randn(8, 64, 56, 56)
x_cl = x.to(memory_format=torch.channels_last)
print(x.stride())
print(x_cl.stride())卷积 kernel 看到的是同一个数学对象;性能差异来自 memory access pattern 和底层 kernel。一般经验:
| Setting | Often faster with |
|---|---|
| CUDA + AMP/BF16/FP16 CNN | channels_last |
| CPU small tensors | benchmark required |
| custom ops expecting contiguous NCHW | plain contiguous |
x.shape == x_cl.shape does not imply the same memory format. A custom CUDA/Triton op or a .view(...) call may silently assume contiguous NCHW. Prefer .reshape(...) when layout may be non-contiguous, and profile before changing global memory format.
Receptive Field
receptive field 描述输出位置能看到输入的多大范围。令第 \(\ell\) 层 kernel 为 \(k_\ell\),stride 为 \(s_\ell\),dilation 为 \(d_\ell\)。定义 jump \(j_\ell\) 为输出相邻位置在原输入上的间隔,receptive field \(r_\ell\) 为单个输出位置覆盖的输入范围:
\[ j_\ell=j_{\ell-1}s_\ell, \qquad r_\ell=r_{\ell-1}+(k_\ell-1)d_\ell j_{\ell-1}. \]
初始:
\[ j_0=1,\qquad r_0=1. \]
例如两层 \(3\times3\) stride-1 convolution 的 receptive field 是 \(5\times5\),不是 \(6\times6\),因为中心像素重叠。
Receptive Field Center
只知道 \(r_\ell\) 还不够,调试检测/分割网络时还要知道输出格点中心落在原图哪里。用一维写法,令 \(a_\ell\) 表示第 \(\ell\) 层第 \(0\) 个输出位置对应的原输入中心坐标。若 padding 为 \(p_\ell\),则
\[ a_\ell = a_{\ell-1} + \left(\frac{k_{\text{eff},\ell}-1}{2}-p_\ell\right)j_{\ell-1}, \qquad k_{\text{eff},\ell}=d_\ell(k_\ell-1)+1. \]
初始 \(a_0=0.5\) 表示第一个像素中心。完整递推:
\[ \begin{aligned} j_\ell &= j_{\ell-1}s_\ell,\\ r_\ell &= r_{\ell-1}+(k_{\text{eff},\ell}-1)j_{\ell-1},\\ a_\ell &= a_{\ell-1}+\left(\frac{k_{\text{eff},\ell}-1}{2}-p_\ell\right)j_{\ell-1}. \end{aligned} \]
def rf_table(layers):
# layers: list of dicts with k, s, p, d
jump, field, start = 1, 1, 0.5
rows = []
for layer in layers:
k = layer["k"]
s = layer.get("s", 1)
p = layer.get("p", 0)
d = layer.get("d", 1)
k_eff = d * (k - 1) + 1
start = start + ((k_eff - 1) / 2 - p) * jump
field = field + (k_eff - 1) * jump
jump = jump * s
rows.append({"jump": jump, "rf": field, "start": start})
return rows直观上,
jump是 feature map 一个格子对应原图移动几像素,rf是看到多大窗口,start是对齐是否偏半格。很多 dense prediction 的 bug 不是模型“不收敛”,而是上采样、padding、stride 组合后输出和 label 对不齐。
Pooling
Pooling 是固定窗口聚合:
\[ Y_{i,j} = \max_{(a,b)\in\mathcal{W}}X_{i+a,j+b} \]
或平均:
\[ Y_{i,j} = \frac{1}{|\mathcal{W}|} \sum_{(a,b)\in\mathcal{W}}X_{i+a,j+b}. \]
MaxPool 保留局部最强响应;AvgPool 更像低通平滑。现代网络也常用 strided convolution 替代 pooling,让下采样也可学习。
Dilation and Causal CNN
Dilated convolution 在 kernel 元素之间插入间隔。它扩大 receptive field,但不增加参数量:
k=3, dilation=2 sees positions [i-2, i, i+2]
在序列建模中,causal convolution 要保证输出位置 \(t\) 只依赖 \(x_{\leq t}\):
\[ y_t = \sum_{a=0}^{k-1} w_a x_{t-a}. \]
实现时通常 left padding:
class CausalConv1d(torch.nn.Module):
def __init__(self, channels, kernel_size, dilation):
super().__init__()
self.pad = dilation * (kernel_size - 1)
self.conv = torch.nn.Conv1d(
channels,
channels,
kernel_size,
dilation=dilation,
)
def forward(self, x):
x = F.pad(x, (self.pad, 0))
return self.conv(x)Depthwise Separable Convolution
普通卷积参数量:
\[ C_{\text{out}}C_{\text{in}}K_hK_w. \]
depthwise separable convolution 拆成:
- depthwise: 每个输入 channel 单独做空间卷积;
- pointwise: 用 \(1\times1\) 卷积混合 channel。
参数量:
\[ C_{\text{in}}K_hK_w + C_{\text{out}}C_{\text{in}}. \]
当 \(K_hK_w\) 和 \(C_{\text{out}}\) 都不小时,节省很明显。这是 MobileNet 等轻量模型的核心。
Grouped Convolution as Block-Sparse Connectivity
groups=g 可以看成把 channel 分成 \(g\) 组,每组只连接自己的输入和输出。若 \(C_{\text{in}}\)、\(C_{\text{out}}\) 都能被 \(g\) 整除,则参数量为
\[ N_{\text{param}} = C_{\text{out}}\frac{C_{\text{in}}}{g}K_hK_w. \]
从矩阵角度看,普通卷积在 channel 维是 dense mixing,grouped convolution 是 block diagonal mixing。depthwise convolution 是 \(g=C_{\text{in}}\) 的极端情况;它只做空间滤波,不做跨 channel mixing,所以通常要接一个 1x1 pointwise conv。
In depthwise convolution, a depth multiplier \(m\) means each input channel produces \(m\) output channels. Then \(C_{\text{out}}=mC_{\text{in}}\) and groups=C_in.
depthwise = torch.nn.Conv2d(
in_channels=64,
out_channels=64,
kernel_size=3,
padding=1,
groups=64,
)
pointwise = torch.nn.Conv2d(64, 128, kernel_size=1)Convolution as Matrix Multiplication
卷积可以通过 unfold/im2col 转为矩阵乘。输入 patch:
patches: [B, C_in*K_h*K_w, H_out*W_out]
weight: [C_out, C_in*K_h*K_w]
则:
\[ Y = W\cdot \operatorname{unfold}(X). \]
patches = torch.nn.Unfold(kernel_size=3, padding=1)(x)
w = conv.weight.reshape(conv.out_channels, -1)
y = w @ patches真实库不会总是显式 materialize im2col,因为它可能占很多内存;cuDNN/Triton kernels 会根据 shape 选择 direct、implicit GEMM、FFT、Winograd 等策略。
im2col Memory Bill
显式 unfold 后的临时张量大小是
\[ B\cdot C_{\text{in}}K_hK_w\cdot H_{\text{out}}W_{\text{out}}. \]
比如 \(B=32,C_{\text{in}}=64,H=W=224,K=3\),stride \(1\) 且 same padding,则元素数约为
\[ 32\cdot64\cdot9\cdot224\cdot224 \approx 9.25\times10^8. \]
FP16 也要约 \(1.85\)GB 临时内存。这解释了为什么教材里 im2col + GEMM 很清楚,但工业 kernel 经常使用 implicit GEMM:逻辑上像展开,物理上不 materialize 完整 patch matrix。
Backpropagation Through Convolution
设 loss 对输出的梯度为
\[ G_{b,c_o,i,j} = \frac{\partial L}{\partial Y_{b,c_o,i,j}}. \]
对权重的梯度:
\[ \frac{\partial L}{\partial W_{c_o,c_i,a,b}} = \sum_{b_0,i,j} G_{b_0,c_o,i,j} X_{b_0,c_i,i+a,j+b}. \]
它也是一种 cross-correlation:用输入 patch 和输出梯度做相关。对输入的梯度:
\[ \frac{\partial L}{\partial X_{b_0,c_i,u,v}} = \sum_{c_o,a,b} G_{b_0,c_o,u-a,v-b} W_{c_o,c_i,a,b}, \]
也就是把 output gradient 通过 kernel 的转置线性映射传播回输入。所谓 transposed convolution 正是这个线性代数关系的前向版本。
grad_weight correlates input patches with output gradients. grad_input applies the transpose of the convolution’s linear map to output gradients.
可以用 autograd 做一个 sanity check:手写 unfold 版本和 Conv2d 应该给出近似相同的前向与梯度。
import torch
import torch.nn.functional as F
def conv2d_unfold(x, weight, bias=None, padding=0, stride=1):
bsz = x.shape[0]
cout = weight.shape[0]
kh, kw = weight.shape[-2:]
patches = F.unfold(x, kernel_size=(kh, kw), padding=padding, stride=stride)
y = weight.reshape(cout, -1) @ patches
if bias is not None:
y = y + bias.view(1, -1, 1)
h_out = int((x.shape[-2] + 2 * padding - kh) / stride + 1)
w_out = int((x.shape[-1] + 2 * padding - kw) / stride + 1)
return y.view(bsz, cout, h_out, w_out)Stride, dilation, padding, and groups all affect backward. If a custom convolution forward silently changes padding or channel grouping, gradients can have correct shapes but wrong semantics.
PyTorch Conv2d
conv = torch.nn.Conv2d(
in_channels=3,
out_channels=64,
kernel_size=3,
stride=1,
padding=1,
dilation=1,
groups=1,
bias=True,
)groups 控制 channel 连接:
| groups | Meaning |
|---|---|
1 |
normal convolution |
C_in with C_out=C_in |
depthwise convolution |
| between | grouped convolution |
weight shape 仍然是:
[C_out, C_in / groups, K_h, K_w]
这点调试 grouped conv 时很容易错。
Initialization for Convolution
卷积层的 fan-in 不是 \(C_{\text{in}}\),而是一个输出位置实际汇聚的输入数量:
\[ \operatorname{fan\_in} = \frac{C_{\text{in}}}{g}K_hK_w, \qquad \operatorname{fan\_out} = \frac{C_{\text{out}}}{g}K_hK_w. \]
对 ReLU-like activation,Kaiming initialization 常用
\[ \operatorname{Var}(W) \approx \frac{2}{\operatorname{fan\_in}}. \]
这和 MLP 的推导一样:希望前向激活方差不要层层爆炸或消失,只是每个输出位置的输入维度变成了局部窗口大小。
conv = torch.nn.Conv2d(64, 128, kernel_size=3, padding=1, groups=1)
torch.nn.init.kaiming_normal_(conv.weight, mode="fan_out", nonlinearity="relu")fan_in 更强调前向方差,fan_out 更强调反向梯度方差。CNN 分类模型里很多 ResNet 实现用 fan_out,因为深层残差网络的梯度传播更敏感。
BatchNorm and Conv-BN Folding
BatchNorm2d 对每个 channel 独立归一化。训练时:
\[ \hat{x}_{b,c,i,j} = \frac{x_{b,c,i,j}-\mu_c}{\sqrt{\sigma_c^2+\epsilon}}, \qquad y_{b,c,i,j} = \gamma_c\hat{x}_{b,c,i,j}+\beta_c. \]
其中 \(\mu_c,\sigma_c^2\) 在 batch 和 spatial dimensions 上统计:
\[ \mu_c = \frac{1}{BHW}\sum_{b,i,j}x_{b,c,i,j}. \]
推理时使用 running mean/variance。若前一层是 convolution:
\[ z=W*x+b, \qquad y_c= \gamma_c\frac{z_c-\mu_c}{\sqrt{\sigma_c^2+\epsilon}}+\beta_c, \]
则可以把 BN fold 进 convolution:
\[ W'_c = \frac{\gamma_c}{\sqrt{\sigma_c^2+\epsilon}}W_c, \qquad b'_c = \frac{\gamma_c}{\sqrt{\sigma_c^2+\epsilon}}(b_c-\mu_c)+\beta_c. \]
Conv-BN folding replaces a Conv2d -> BatchNorm2d pair by one equivalent Conv2d during inference, using BatchNorm running statistics.
这对部署很重要:fold 后少一次 memory read/write,也更容易被推理引擎 fuse。训练时不能简单 fold,因为 BN 的 batch statistics 依赖当前 mini-batch。
def fold_conv_bn(conv, bn):
scale = bn.weight / torch.sqrt(bn.running_var + bn.eps)
w = conv.weight * scale.view(-1, 1, 1, 1)
if conv.bias is None:
bias = torch.zeros_like(bn.running_mean)
else:
bias = conv.bias
b = (bias - bn.running_mean) * scale + bn.bias
return w, bmodel.train() uses batch statistics and updates running statistics. model.eval() uses stored running statistics. Many validation accuracy jumps are just BatchNorm/Dropout mode bugs.
Residual Blocks
一个基本 residual block 写成
\[ y=x+F(x). \]
反向传播时
\[ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} + \left(\frac{\partial F}{\partial x}\right)^\top \frac{\partial L}{\partial y}. \]
即使 \(F\) 的 Jacobian 暂时不好,identity path 也给梯度提供一条直接通路。这是 ResNet 比 plain deep CNN 更容易训练的核心。
当 spatial size 或 channel 改变时,不能直接相加,需要 projection shortcut:
class BasicBlock(torch.nn.Module):
def __init__(self, cin, cout, stride):
super().__init__()
self.conv1 = torch.nn.Conv2d(cin, cout, 3, stride=stride, padding=1, bias=False)
self.bn1 = torch.nn.BatchNorm2d(cout)
self.conv2 = torch.nn.Conv2d(cout, cout, 3, padding=1, bias=False)
self.bn2 = torch.nn.BatchNorm2d(cout)
if cin == cout and stride == 1:
self.proj = torch.nn.Identity()
else:
self.proj = torch.nn.Sequential(
torch.nn.Conv2d(cin, cout, 1, stride=stride, bias=False),
torch.nn.BatchNorm2d(cout),
)
def forward(self, x):
h = F.relu(self.bn1(self.conv1(x)))
h = self.bn2(self.conv2(h))
return F.relu(h + self.proj(x))Pre-activation ResNet 把 BN/ReLU 放到 convolution 前,使 residual branch 的最后一步不再被 ReLU 截断:
post-activation: conv -> bn -> relu -> conv -> bn -> add -> relu
pre-activation: bn -> relu -> conv -> bn -> relu -> conv -> add
对很深的网络,pre-activation 更接近“identity path 完全干净”的设计;对较浅网络,差异可能不明显,但理解这个结构有助于读现代 vision backbone。
Upsampling and Transposed Convolution
上采样常见方法:
| Method | Learnable? | Risk |
|---|---|---|
| nearest/bilinear interpolate | no | blurry or blocky |
| resize + conv | conv learnable | more stable |
| transposed conv | yes | checkerboard artifacts |
Transposed convolution 不是严格的“反卷积”,而是普通卷积对输入的线性映射矩阵的转置。它可以学习上采样,但 stride/kernel 配置不当时会造成 uneven overlap。
Transposed Convolution Output Size
一维情形下,ConvTranspose 输出长度为
\[ n_{\text{out}} = (n_{\text{in}}-1)s -2p +d(k-1) +\operatorname{output\_padding} +1. \]
这里 output_padding 不是在输出末尾补零,而是在多个可能输出 shape 中选择一个。它只解决 shape ambiguity,不解决 checkerboard artifacts。
为什么会有 checkerboard?stride \(s>1\) 时,transposed convolution 可以理解为先在输入位置之间插入 \(s-1\) 个零,再做普通卷积。若 kernel size 不能被 stride 整除,不同输出位置被覆盖的次数不同:
stride=2, kernel=3
coverage pattern: 1,2,1,2,1,2,...
这会让某些像素天然收到更多累加项。常见缓解:
- 用
interpolate(..., mode="nearest" or "bilinear") + Conv2d; - 选择 kernel size 可被 stride 整除的配置;
- 在生成模型中配合 normalization 和 anti-aliasing 设计;
- 用 pixel shuffle / sub-pixel convolution。
Pixel Shuffle
Pixel shuffle 先用 convolution 生成 \(r^2C\) 个 channel,再把 channel 重排为空间分辨率:
[B, C*r*r, H, W] -> [B, C, H*r, W*r]
它没有插零操作,常用于 super-resolution。它的关键假设是 channel 维里已经学好了每个 sub-pixel 的内容。
up = torch.nn.Sequential(
torch.nn.Conv2d(64, 64 * 4, kernel_size=3, padding=1),
torch.nn.PixelShuffle(upscale_factor=2),
)For segmentation and restoration, a one-pixel alignment error can look like poor model quality. Track stride, padding, and crop conventions from input image to label grid.
Implementation Checklist
写 CNN 时检查:
- 输入 layout 是否为
[B,C,H,W]; Conv2d.weightshape 是否按[C_out,C_in/groups,K_h,K_w]理解;- output size 是否使用 dilation 后的 effective kernel;
- padding 是否符合任务边界假设;
- stride/pooling 是否过早丢失空间分辨率;
- receptive field 是否覆盖任务所需上下文;
- grouped/depthwise conv 的 channel 数是否可整除;
- causal conv 是否只 left pad;
- upsampling 是否避免 checkerboard artifacts;
- BatchNorm 是否在 train/eval mode 下语义正确;
- Conv-BN folding 是否只用于 inference;
- residual add 两侧 shape、stride、dtype 是否一致;
channels_last、cuDNN benchmark、AMP 是否经过 profile 验证;- 自定义卷积/上采样是否做过前向与梯度的数值对齐检查。