1.3 Computation Performance

深度学习性能优化不是“把代码放到 GPU 上”这么简单。训练一步由多段流水线构成：

storage -> CPU decode/augment -> host batch -> H2D copy
        -> GPU forward -> GPU backward -> optimizer -> logging/checkpoint

任何一段慢，都会让其他硬件等待。本节把 PyTorch 性能问题拆成内存、异步、kernel、数据管线和多 GPU 通信五个层面。

Hardware Mental Model

现代训练机有多级资源：

Component	Strength	Bottleneck symptom
CPU cores	parsing, augmentation, dataloader workers	GPU utilization low
host RAM	dataset cache, dataloader queue	swapping, worker killed
PCIe/NVLink	CPU-GPU or GPU-GPU transfer	copy time dominates
GPU SMs	matrix/conv compute	low occupancy or tiny kernels
GPU HBM	high-bandwidth tensor memory	memory-bound kernels
disk/network	dataset/checkpoint IO	dataloader stalls

最常见的误判是只看 GPU 显存占用。显存满不代表算力满；GPU utilization 高也不代表没有数据管线瓶颈。

Roofline: Compute-Bound or Memory-Bound

一个 kernel 的速度上限通常由两个资源决定：算力峰值和内存带宽。定义 arithmetic intensity：

\[ I = \frac{\text{FLOPs}}{\text{bytes moved}}. \]

如果硬件峰值算力是 \(P_{\max}\) FLOP/s，内存带宽是 \(B_{\max}\) byte/s，那么 roofline 近似为

\[ \operatorname{throughput} \le \min(P_{\max}, I B_{\max}). \]

当 \(I\) 很小，性能受内存带宽限制；当 \(I\) 足够大，性能才可能接近 tensor core / CUDA core 的算力峰值。

Definition: Arithmetic Intensity

Arithmetic intensity is the ratio between floating-point work and memory traffic: \(I=\text{FLOPs}/\text{bytes moved}\).

例子：

Operation	Typical bottleneck
large matmul	compute-bound if shape is tensor-core friendly
elementwise add/relu	memory-bound
layernorm/softmax	often memory/reduction-bound
small matrix ops in Python loop	launch overhead + poor occupancy

这解释了一个常见现象：把 elementwise op 换成 FP16 不一定让它快两倍，因为它可能已经被 memory bandwidth 限制；而大矩阵乘如果命中 tensor core，低精度收益会明显得多。

Concrete Intensity Estimates

roofline 的价值在于它能提前告诉你“优化方向是否合理”。先看 elementwise add：

\[ y_i=x_i+z_i. \]

对每个元素大约 1 FLOP。若用 FP32，读 \(x_i,z_i\) 各 4 bytes，写 \(y_i\) 4 bytes，总内存流量约 12 bytes，所以

\[ I_{\text{add}} \approx \frac{1}{12} \text{ FLOP/byte}. \]

这非常低，几乎一定是 memory-bound。把它写成更复杂的 Python 代码不会有用；真正有用的是减少读写次数、fusion、in-place 语义可控时减少中间 tensor。

再看矩阵乘：

\[ C_{M\times N}=A_{M\times K}B_{K\times N}. \]

FLOPs 约为

\[ 2MKN. \]

若粗略只算一次读取 \(A,B\) 和一次写 \(C\)，BF16 下 bytes 约为

\[ 2(MK+KN+MN). \]

于是 arithmetic intensity 近似：

\[ I_{\text{matmul}} \approx \frac{2MKN}{2(MK+KN+MN)} = \frac{MKN}{MK+KN+MN}. \]

若 \(M=N=K=4096\)，则

\[ I_{\text{matmul}} \approx \frac{4096}{3} \approx 1365 \text{ FLOP/byte}. \]

这就很可能 compute-bound，并且 tensor core、shape 对齐、kernel selection 会比普通内存带宽更关键。

Definition: Memory-Bound Kernel

A memory-bound kernel is limited primarily by memory traffic rather than arithmetic throughput, so reducing bytes moved often matters more than reducing FLOPs.

Throughput Units

训练性能不要只报 seconds/step。不同任务需要不同单位：

Unit	Best for	Caveat
samples/s	image/classification	sample cost must be similar
tokens/s	language modeling	depends on padding/packing
optimizer steps/s	training loop overhead	hides batch-size changes
TFLOP/s	compute utilization	requires correct FLOP estimate
MFU	LLM pretraining	model-specific FLOP convention

语言模型中，tokens/s 应该使用有效 token 数，而不是 padded sequence length：

\[ \text{tokens/s} = \frac{\sum_{b,t}m_{bt}} \text{step time}. \]

如果一个优化让 padded tokens/s 提高，但有效 tokens/s 不变，可能只是改变了 padding 分布或 batch packing，而不是真正提高模型吞吐。

Pitfall: Throughput Must Match the Objective Unit

For sequence models, report effective tokens/s together with padding ratio. Padded-token throughput can make inefficient batching look faster than it is.

Tensor Layout and Memory Locality

Tensor 的核心结构仍是：

\[ \text{Tensor} = (\text{storage},\text{shape},\text{stride},\text{dtype},\text{device}). \]

寻址公式：

\[ \operatorname{idx}(i_0,\ldots,i_{n-1}) = \operatorname{storage\_offset} + \sum_{k=0}^{n-1}i_ks_k. \]

性能上，stride 决定访问是否连续。GPU 中一个 warp 的相邻线程如果访问相邻地址，就能 coalesce 成更高效的 memory transaction；如果访问跨步很大，带宽利用会下降。

Definition: Coalesced Access

Coalesced access means neighboring GPU threads access neighboring memory addresses, allowing hardware to combine memory transactions efficiently.

这解释了为什么 permute 后常需要 contiguous()：

x = torch.randn(32, 128, 768, device="cuda")
y = x.transpose(1, 2)       # non-contiguous
z = y.contiguous()          # physical reorder

contiguous() 是一次真实 copy，不要在内层循环里随手调用。更好的做法是让数据 layout 在进入 hot path 前就固定。

Memory Allocation and Peak Tracking

PyTorch 的 CUDA allocator 会缓存显存，所以 nvidia-smi 看到的 reserved memory 不等于当前 live tensor 的真实大小。训练时常看两个量：

torch.cuda.reset_peak_memory_stats()
loss = train_step(batch)
peak = torch.cuda.max_memory_allocated()
reserved = torch.cuda.max_memory_reserved()

Metric	Meaning
allocated	live tensors currently held by PyTorch
max allocated	peak live tensor memory since reset
reserved	memory reserved by caching allocator
max reserved	peak reserved memory

Pitfall: Reserved Memory Is Not a Leak by Itself

CUDA reserved memory can stay high because PyTorch caches blocks for reuse. A real leak usually shows max_memory_allocated() or live tensor references growing over steps.

常见显存峰值来源：

activation saved for backward；
gradients；
optimizer states；
temporary tensors from non-fused ops；
hidden copies from contiguous(), reshape, advanced indexing, dtype casts；
logging code holding graph-attached tensors in a Python list。

Training Memory Ledger

训练显存可以分成静态和动态两部分。静态部分主要与参数量 \(P\) 有关：

Item	BF16/FP16 bytes per param	FP32 bytes per param
parameter	2	4
gradient	2	4
Adam first moment	4	4
Adam second moment	4	4
master weight if used	4	0 or 4

典型 AdamW mixed precision 训练中，每个参数可能需要：

\[ 2\text{ bytes param} +2\text{ bytes grad} +4\text{ bytes }m +4\text{ bytes }v +4\text{ bytes master} =16\text{ bytes}. \]

所以一个 \(P=1\)B 参数模型，仅这些状态就可能约为

\[ 16P\approx16\text{ GB}. \]

这还没算 activation、temporary tensors、CUDA workspace 和 fragmentation。推理时没有 gradient/optimizer states，显存账本完全不同。

Definition: Memory Ledger

A memory ledger decomposes peak memory into parameters, gradients, optimizer states, activations, temporary tensors, workspaces, and allocator fragmentation.

Activation Scaling

activation 显存通常随 batch、sequence、hidden size、layer 数增长。粗略写：

\[ M_{\text{act}} \propto B\cdot T\cdot d\cdot L\cdot \text{bytes}. \]

但真实 Transformer 还会保存 attention probabilities、MLP 中间激活、norm 输入、dropout mask 等。训练中 checkpointing 可以少保存一部分 activation，backward 时重新计算：

Technique	Saves	Pays
activation checkpointing	activation memory	extra forward compute
gradient accumulation	activation per micro-batch	more steps per update
sequence packing	padding activation waste	packing complexity
FlashAttention	attention matrix memory	kernel constraints

如果 OOM 随 \(B\) 和 \(T\) 线性变化，通常是 activation；如果 OOM 与 optimizer choice 强相关，通常是 optimizer states；如果 OOM 只出现在某些 shape，可能是 workspace 或 temporary tensor。

Leak Versus Legitimate Growth

一个常见误判是把 allocator warmup 当成 leak。真正的 leak 往往表现为每步 live tensor 持续增长：

torch.cuda.reset_peak_memory_stats()
for step, batch in enumerate(loader):
    loss = train_step(batch)
    if step % 10 == 0:
        torch.cuda.synchronize()
        print(
            step,
            torch.cuda.memory_allocated(),
            torch.cuda.max_memory_allocated(),
        )

如果 memory_allocated() 一直增长，检查：

是否把 loss、logits、activation 直接 append 到 Python list；
hook 是否保存了未 .detach() 的 output；
validation 是否忘了 torch.no_grad()；
gradient accumulation 是否忘了按计划 zero_grad；
exception path 是否跳过了清理逻辑。

Pitfall: Logging Can Keep Graphs Alive

Appending graph-attached tensors for later logging keeps their whole autograd history. Log Python scalars or detached CPU tensors.

CUDA Asynchrony

PyTorch 的 CUDA op 通常异步入队：

y = x @ w
z = y.relu()

Python 线程不一定等 GPU 算完才继续。同步会在这些地方发生：

torch.cuda.synchronize()；
从 GPU tensor 取 Python scalar：loss.item()；
打印 GPU tensor；
GPU -> CPU copy；
某些错误被延迟到后续同步点才暴露。

Pitfall: Naive Timing Lies on CUDA

Wall-clock timing around CUDA ops without synchronization measures enqueue time, not actual GPU execution time.

正确 benchmark：

import time
import torch


def time_cuda(fn, warmup=10, repeat=50):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(repeat):
        fn()
    torch.cuda.synchronize()
    return (time.perf_counter() - start) / repeat

或用 CUDA events：

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
fn()
end.record()
torch.cuda.synchronize()
ms = start.elapsed_time(end)

Benchmark Hygiene

一个稳定 benchmark 至少要控制：

warmup：让 CUDA context、kernel cache、compile/fusion 开销先发生；
synchronization：测 GPU 执行时间，而不是 enqueue 时间；
fixed shapes：避免动态 shape 引入额外路径；
fixed dtype/layout：避免不小心测到 cast/copy；
enough repeats：小 kernel 的噪声很大；
no logging in hot path：item() 和 print 会同步。

def benchmark_step(train_step, batch, warmup=5, repeat=20):
    for _ in range(warmup):
        train_step(batch)
    torch.cuda.synchronize()
    torch.cuda.reset_peak_memory_stats()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    for _ in range(repeat):
        train_step(batch)
    end.record()
    torch.cuda.synchronize()

    ms = start.elapsed_time(end) / repeat
    peak = torch.cuda.max_memory_allocated()
    return {"ms": ms, "peak_bytes": peak}

Pitfall: First Step Is Not Representative

The first few iterations often include CUDA context creation, kernel selection, memory-pool growth, and graph compilation. Report steady-state time after warmup.

CPU-GPU Transfer

GPU 训练中，host-to-device copy 可能成为隐形瓶颈。典型写法：

loader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,
    pin_memory=True,
    persistent_workers=True,
)

for batch in loader:
    batch = {
        k: v.to("cuda", non_blocking=True)
        for k, v in batch.items()
    }

pin_memory=True 让 DataLoader 把 batch 放到 page-locked host memory，GPU DMA 拷贝更容易异步。non_blocking=True 只有在 source memory pinned、目标是 CUDA 等条件满足时才真正有意义。

如果 batch 中有很多小 tensor，copy launch overhead 会变大。collate 时尽量把字段合并成少数大 tensor。

Overlapping Copy and Compute

理想情况下，下一批数据的 H2D copy 可以和当前 batch 的 GPU compute 重叠。一个简单 prefetcher 会把 copy 放到单独 CUDA stream：

class CudaPrefetcher:
    def __init__(self, loader, device):
        self.loader = iter(loader)
        self.device = device
        self.stream = torch.cuda.Stream()
        self.next_batch = None
        self.preload()

    def preload(self):
        try:
            batch = next(self.loader)
        except StopIteration:
            self.next_batch = None
            return
        with torch.cuda.stream(self.stream):
            self.next_batch = {
                k: v.to(self.device, non_blocking=True)
                for k, v in batch.items()
            }

    def next(self):
        torch.cuda.current_stream().wait_stream(self.stream)
        batch = self.next_batch
        self.preload()
        return batch

这段代码只有在 DataLoader、pinned memory、batch tensor 数量和 GPU compute 足够长时才有收益。若 copy 本身很短，prefetcher 可能只是增加复杂度。

Pitfall: Async Copy Needs a Lifetime Contract

When using a separate CUDA stream, the source batch must stay alive until the copy finishes. Prematurely freeing or mutating host tensors can cause subtle bugs.

Kernel Launch and Fusion

GPU 适合大矩阵/卷积，不适合大量很小的 Python-level op：

# slow pattern: many tiny kernels
for i in range(x.shape[0]):
    y[i] = torch.relu(x[i] @ w)

更好的写法是 batch 化：

y = torch.relu(x @ w)

每个 PyTorch op 可能对应一个或多个 kernel launch。kernel launch 有固定开销；小 tensor 上的很多逐元素 op 会被 launch overhead 和 memory bandwidth 主导。

torch.compile、TorchScript、nvFuser、Triton kernel 和 fused optimizer 的目标都是减少 Python 调度和 kernel 数量：

compiled_model = torch.compile(model)

但 compile 不是免费午餐：动态 shape、Python control flow、数据依赖分支、非标准 op 都可能导致 graph break。使用时要对比：

compile time；
steady-state step time；
graph break 数量；
numerics 是否一致。

`torch.compile` Cost Model

torch.compile 的收益来自把 Python + eager op 序列变成更大的 graph，让编译器做 fusion、layout planning 和 kernel selection。但它有三类成本：

Cost	When it appears	Symptom
compile latency	first time seeing a graph/shape	first step very slow
graph break	unsupported Python/op boundary	many small compiled regions
recompilation	new dynamic shape/control path	periodic long steps

因此 benchmark compile 时必须分开报告：

\[ t_{\text{first}}, \qquad t_{\text{steady}}, \qquad N_{\text{graphs}}. \]

如果一个训练只跑几十步，compile latency 可能无法摊销；如果是长时间训练，steady-state 才重要。

一个常见诊断流程：

model = torch.compile(model, fullgraph=False)

for _ in range(warmup_compile_steps):
    train_step(batch)

result = benchmark_step(train_step, batch)

然后打开 graph break 日志或解释工具，定位哪些 Python 语句让 graph 断开。典型 graph break 来源：

tensor-dependent Python if；
动态创建不同 shape 的 tensor；
.item() 把 GPU 值拉回 Python；
list/dict control flow 依赖 tensor 值；
自定义 op 没有 decomposition。

Pitfall: Compile Speedup Must Be Steady-State

Do not report a compile optimization without separating compile latency from steady-state step time and checking for recompilation under real input shapes.

Dynamic Shapes

NLP 训练常有变长序列。如果每个 batch 的 \(T\) 都不同，编译器可能看到许多 shape。两种常见处理：

Strategy	Benefit	Cost
pad to fixed length	fewer graphs, stable kernels	padding waste
bucket by length	fewer shapes, less waste	sampler/collate complexity
dynamic-shape compile	fewer recompiles	may reduce optimization opportunities

这和 padding/packing 不只是数据管线问题，也会影响 kernel selection 和 compile cache。性能实验要同时记录 sequence length distribution，否则不同 run 的 step time 很难比较。

Shape Matters for Tensor Cores

低精度矩阵乘快，前提是 shape 能有效使用 tensor cores。通常大而规则的矩阵更容易跑满：

[B*T, d] @ [d, 4d]

比很多小矩阵循环更好：

for token in T:
    [B, d] @ [d, 4d]

经验上，hidden size、head dim、batch-token product 对齐到硬件友好的倍数，往往比“理论 FLOPs 一样”更重要。比如把多个 token 或多个样本合并到同一次 GEMM，能提高 occupancy，降低 launch overhead。

Fusion: What It Actually Saves

融合通常节省三类成本：

Python dispatch；
kernel launch；
中间 tensor 的 HBM 读写。

例如

\[ y=\operatorname{dropout}(\operatorname{gelu}(xW+b)) \]

如果每一步都 materialize 中间 tensor，就会反复写回 HBM。融合 kernel 可以在寄存器/shared memory 中完成更多步骤。性能收益最大时，往往不是 FLOPs 少了，而是 memory traffic 少了。

Definition: Kernel Fusion

Kernel fusion combines multiple tensor operations into fewer kernels, reducing launch overhead and intermediate memory traffic.

Mixed Precision

FP16/BF16 的性能收益来自两点：

tensor core 对低精度矩阵乘更快；
activation/gradient 占用更少显存和带宽。

典型 AMP：

scaler = torch.cuda.amp.GradScaler()

for batch in loader:
    optimizer.zero_grad(set_to_none=True)
    with torch.autocast("cuda", dtype=torch.float16):
        loss = compute_loss(model, batch)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

BF16 常不需要 GradScaler，因为 exponent 范围比 FP16 大：

with torch.autocast("cuda", dtype=torch.bfloat16):
    loss = compute_loss(model, batch)
loss.backward()
optimizer.step()

混合精度不是简单把所有东西变成 half。常见约定：

State	Common dtype
model weights	FP16/BF16 or FP32 master
optimizer moments	FP32
loss reductions	FP32
logits softmax	often FP32 internally
labels/indices	int64

Gradient Scaling and Overflow

FP16 的 exponent range 小，梯度可能 underflow；loss scaling 把 loss 乘上 scale 再 backward：

\[ \tilde{L}=sL, \qquad \nabla\tilde{L}=s\nabla L. \]

optimizer step 前再把梯度除以 \(s\)。如果梯度出现 inf/NaN，GradScaler 会跳过 step 并降低 scale。

scale_before = scaler.get_scale()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
scale_after = scaler.get_scale()
overflow = scale_after < scale_before

如果 scale 频繁下降，说明数值不稳定。常见原因包括 LR 太大、loss reduction 不合适、softmax/logits 溢出、norm 层 dtype 不稳。

Mixed Precision Audit

混合精度启用后至少检查：

loss 曲线是否和 FP32 baseline 接近；
throughput 是否真的提高；
peak memory 是否下降；
GradScaler 是否频繁 overflow；
norm/softmax/loss 是否保持稳定 dtype；
labels、indices、mask 是否没有被误 cast。

DataLoader Bottlenecks

DataLoader 性能取决于 dataset、collate、worker、IO。一个简单测量：

import time

end = time.perf_counter()
for step, batch in enumerate(loader):
    data_time = time.perf_counter() - end
    loss = train_step(batch)
    step_time = time.perf_counter() - end
    end = time.perf_counter()

若 data_time 占比高，优先检查：

num_workers 是否太少或太多；
图像解码/文本 tokenization 是否在线执行；
collate 是否做了 Python 循环和频繁小 tensor 分配；
是否反复从慢磁盘/网络读取；
batch 是否包含大量无法 pinned/copy 的 Python 对象。

Cache Expensive Deterministic Preprocessing

If tokenization, resizing, or feature extraction is deterministic and reused across epochs, cache it offline or in an indexed binary format instead of recomputing inside every worker.

Worker Tuning and Queueing

DataLoader 可以看成一个 producer-consumer queue。CPU workers 生产 batch，GPU 消费 batch。若平均生产时间 \(t_{\text{data}}\) 大于 GPU step 时间 \(t_{\text{gpu}}\)，GPU 就会等数据：

\[ t_{\text{step}} \approx \max(t_{\text{data}},t_{\text{gpu}}) \]

当 prefetch queue 足够深、workers 足够多时，理想情况是 \(t_{\text{data}}\) 被隐藏在上一轮 GPU compute 后面。常调参数：

Knob	Effect	Failure mode
`num_workers`	parallel CPU loading	too many workers cause context overhead/RAM pressure
`prefetch_factor`	queue depth per worker	too high increases RAM
`persistent_workers`	avoids worker restart each epoch	stale state if dataset mutates
`pin_memory`	faster async H2D path	only helps tensor batches
custom `collate_fn`	controls padding/stacking	Python loops can dominate

Pitfall: More Workers Is Not Monotonic

Increasing num_workers can slow training when CPU cores, RAM bandwidth, disk IO, or Python serialization become bottlenecks.

Collate Cost

如果 collate 做大量 Python 小操作，比如逐样本 tokenize、逐元素 append、反复创建小 tensor，它会拖慢整个 pipeline。更好的策略：

离线预处理 deterministic work；
在 dataset 中返回 NumPy arrays 或 already-shaped tensors；
在 collate 里一次性 pad/stack；
避免 batch 里塞复杂 Python object；
对 NLP，尽量使用 batched tokenizer 或预 tokenized cache。

Parallel Training

数据并行的基本步骤：

split batch across GPUs
-> each GPU forward/backward
-> all-reduce gradients
-> each GPU optimizer step

若有 \(N\) 张卡，每张卡 micro-batch 为 \(B\)，gradient accumulation 为 \(K\)，global batch 是：

\[ B_{\text{global}} = N\times B\times K. \]

DDP 中每个 parameter 的 gradient bucket 会被 all-reduce。通信量大致与参数量同阶，而不是与 activation 量同阶。大模型训练中常见优化：

Method	Saves	Cost
gradient accumulation	communication frequency	longer optimizer interval
bucket tuning	overlap comm/compute	tuning complexity
FSDP/ZeRO	optimizer/grad/param memory	more communication
tensor parallelism	per-device matmul size	collective inside layers
pipeline parallelism	layer memory	bubbles and schedule complexity

本课程本地机器不适合大规模分布式训练，但理解 DDP 的同步语义有助于读懂现代 LLM 训练系统。

Gradient Accumulation Timing

Gradient accumulation 改变的不只是 batch size，也改变通信和 optimizer 的时序：

optimizer.zero_grad(set_to_none=True)
for micro in range(accum_steps):
    loss = compute_loss(model, batch[micro]) / accum_steps
    loss.backward()
optimizer.step()

DDP 中，如果每个 micro-step 都 all-reduce，通信会很频繁。实际训练常用 no_sync() 让前几个 micro-step 不同步，最后一个 micro-step 再同步：

from contextlib import nullcontext


for micro in range(accum_steps):
    ctx = model.no_sync() if micro < accum_steps - 1 else nullcontext()
    with ctx:
        loss = compute_loss(model, batch[micro]) / accum_steps
        loss.backward()
optimizer.step()

Pitfall: Scheduler Must See the Optimizer Step

When using gradient accumulation, LR scheduler, logging, and global-step counters should usually advance on optimizer steps, not micro-steps.

Communication/Computation Overlap

DDP 把 gradients 打包成 buckets。某个 bucket 的梯度都算完后，就可以开始 all-reduce，同时后面的 layer 还在 backward。这叫 overlap communication with computation。若 bucket 太小，launch/communication overhead 多；bucket 太大，overlap 变差。

读 profiler 时，如果 backward 后面拖着很长的 all-reduce 尾巴，说明通信没有被很好隐藏；如果 GPU compute 中间夹着许多小通信，也可能是 bucket 或模型切分不合适。

Profiling Workflow

不要一上来猜瓶颈。推荐顺序：

先跑 correctness smoke test；
固定 batch 和 seed；
测 step time、data time、GPU utilization、显存峰值；
用 torch.profiler 看 CPU/GPU 时间；
一次只改一个性能变量；
对比吞吐和数值结果。

最小 profiler：

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    profile_memory=True,
) as prof:
    for step, batch in enumerate(loader):
        train_step(batch)
        prof.step()
        if step >= 10:
            break

print(prof.key_averages().table(sort_by="cuda_time_total"))

Profiler Schedule and Trace Export

真实训练不一定要从第 0 步开始 profile，可以跳过 warmup，只记录稳定阶段：

schedule = torch.profiler.schedule(wait=2, warmup=2, active=4, repeat=1)

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=schedule,
    record_shapes=True,
    profile_memory=True,
    with_stack=False,
    on_trace_ready=torch.profiler.tensorboard_trace_handler("./tb_trace"),
) as prof:
    for step, batch in enumerate(loader):
        train_step(batch)
        prof.step()
        if step >= 10:
            break

Profiler 输出要结合三种视角：

View	Question
operator table	哪些 op 总耗时最高
timeline trace	CPU 是否在喂饱 GPU
memory profile	哪些 op 造成峰值分配

Optimization Loop

性能优化也需要实验纪律：

固定 input shape、batch、seed；
保存 baseline：step time、tokens/s、samples/s、peak memory、loss；
提出一个瓶颈假设；
只改一个变量；
重测并比较置信区间或多次均值；
若 throughput 提升但 loss 变坏，不算成功；
若只改善 first-step time，不要当成 steady-state improvement。

Performance Smoke Tests

性能代码也需要 smoke tests。它们不证明最优，但能防止“benchmark 测错对象”或“优化改变语义”。

Test 1: CUDA Timing Harness

import time


def measure_cuda_seconds(fn, warmup=5, repeat=20):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(repeat):
        fn()
    torch.cuda.synchronize()
    elapsed = (time.perf_counter() - start) / repeat

    if elapsed <= 0:
        raise AssertionError("invalid timing result")
    return elapsed

这个测试的重点不是比较具体数值，而是把 warmup、repeat 和 synchronization 固定成一个统一入口。真实项目里更推荐封装统一的 benchmark_step，禁止各处手写计时。

Test 2: Optimization Preserves Numerics

无论是 torch.compile、AMP、fusion，还是换 layout，都要先在小 batch 上比较输出或 loss：

def assert_close_loss(step_a, step_b, batch, atol=1e-4, rtol=1e-4):
    with torch.no_grad():
        loss_a = step_a(batch).detach().float().cpu()
        loss_b = step_b(batch).detach().float().cpu()
    if not torch.allclose(loss_a, loss_b, atol=atol, rtol=rtol):
        raise AssertionError((loss_a.item(), loss_b.item()))

对随机层要先 model.eval() 或固定 dropout seed。吞吐提升但 loss 语义变了，不是性能优化。

Test 3: No Graph-Tensor Logging Leak

def assert_logged_tensors_detached(log_items):
    for item in log_items:
        if torch.is_tensor(item) and item.grad_fn is not None:
            raise AssertionError("logged tensor still has autograd history")

如果你需要保存样本输出用于分析：

saved.append(logits.detach().float().cpu())

不要保存 raw logits。否则显存增长可能来自 Python list 持有整个计算图。

Test 4: Dataloader Is Not the Bottleneck

def measure_data_fraction(loader, train_step, steps):
    end = time.perf_counter()
    data_total = 0.0
    step_total = 0.0
    for i, batch in enumerate(loader):
        data_done = time.perf_counter()
        data_total += data_done - end
        train_step(batch)
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        step_end = time.perf_counter()
        step_total += step_end - data_done
        end = time.perf_counter()
        if i + 1 >= steps:
            break
    return data_total / max(data_total + step_total, 1e-12)

如果 data fraction 很高，先优化 dataset/collate/IO；如果很低，继续看 GPU kernel 和 memory。这个测试让“GPU 利用率低”有一个可复现的入口。

Test 5: Peak Memory Budget

def assert_peak_memory_under(train_step, batch, budget_bytes):
    torch.cuda.reset_peak_memory_stats()
    train_step(batch)
    torch.cuda.synchronize()
    peak = torch.cuda.max_memory_allocated()
    if peak > budget_bytes:
        raise AssertionError(f"peak {peak} > budget {budget_bytes}")

这个测试适合保护教学实验和作业脚本：改 batch、sequence length、模型宽度后，至少知道是否突破了本机显存预算。

Implementation Pattern: Benchmark the Boundary

A useful performance test measures one boundary at a time: dataloader wait, H2D copy, forward/backward/step, peak memory, and logging overhead.

Implementation Checklist

优化 PyTorch 性能时至少检查：

CUDA benchmark 是否 warmup + synchronize；
是否记录 steady-state step time，而不是 first-step time；
loss.item()、print、CPU copy 是否出现在 hot path；
DataLoader 是否让 GPU 等 batch；
H2D copy 是否使用 pinned memory + non-blocking；
batch 是否由大量小 tensor 导致 copy/launch overhead；
tensor layout 是否导致频繁 contiguous()；
小 op 是否可以 batch 化或 fusion；
matmul/conv shape 是否能有效利用 tensor cores；
AMP/BF16 是否真的减少 step time 和显存；
GradScaler 是否频繁 overflow；
peak allocated memory 是否被 activation、temp copy 或 logging 引用撑高；
DDP global batch、accumulation、scheduler step 是否一致；
profiler 证据是否支持当前优化方向；
torch.compile 是否区分 first-step latency 和 steady-state；
dynamic shape 是否导致 recompilation 或 padding 浪费；
throughput 是否按 effective tokens/s 或正确任务单位报告；
性能优化后 loss/metrics 是否仍与 baseline 对齐。

Hardware Mental Model

Roofline: Compute-Bound or Memory-Bound

Concrete Intensity Estimates

Throughput Units

Tensor Layout and Memory Locality

Memory Allocation and Peak Tracking

Training Memory Ledger

Activation Scaling

Leak Versus Legitimate Growth

CUDA Asynchrony

Benchmark Hygiene

CPU-GPU Transfer

Overlapping Copy and Compute

Kernel Launch and Fusion

torch.compile Cost Model

Dynamic Shapes

Shape Matters for Tensor Cores

Fusion: What It Actually Saves

Mixed Precision

Gradient Scaling and Overflow

Mixed Precision Audit

DataLoader Bottlenecks

Worker Tuning and Queueing

Collate Cost

Parallel Training

Gradient Accumulation Timing

Communication/Computation Overlap

Profiling Workflow

Profiler Schedule and Trace Export

Optimization Loop

Performance Smoke Tests

Test 1: CUDA Timing Harness

Test 2: Optimization Preserves Numerics

Test 3: No Graph-Tensor Logging Leak

Test 4: Dataloader Is Not the Bottleneck

Test 5: Peak Memory Budget

Implementation Checklist

`torch.compile` Cost Model