1.3 Computation Performance
深度学习性能优化不是“把代码放到 GPU 上”这么简单。训练一步由多段流水线构成:
storage -> CPU decode/augment -> host batch -> H2D copy
-> GPU forward -> GPU backward -> optimizer -> logging/checkpoint
任何一段慢,都会让其他硬件等待。本节把 PyTorch 性能问题拆成内存、异步、kernel、数据管线和多 GPU 通信五个层面。
Hardware Mental Model
现代训练机有多级资源:
| Component | Strength | Bottleneck symptom |
|---|---|---|
| CPU cores | parsing, augmentation, dataloader workers | GPU utilization low |
| host RAM | dataset cache, dataloader queue | swapping, worker killed |
| PCIe/NVLink | CPU-GPU or GPU-GPU transfer | copy time dominates |
| GPU SMs | matrix/conv compute | low occupancy or tiny kernels |
| GPU HBM | high-bandwidth tensor memory | memory-bound kernels |
| disk/network | dataset/checkpoint IO | dataloader stalls |
最常见的误判是只看 GPU 显存占用。显存满不代表算力满;GPU utilization 高也不代表没有数据管线瓶颈。
Roofline: Compute-Bound or Memory-Bound
一个 kernel 的速度上限通常由两个资源决定:算力峰值和内存带宽。定义 arithmetic intensity:
\[ I = \frac{\text{FLOPs}}{\text{bytes moved}}. \]
如果硬件峰值算力是 \(P_{\max}\) FLOP/s,内存带宽是 \(B_{\max}\) byte/s,那么 roofline 近似为
\[ \operatorname{throughput} \le \min(P_{\max}, I B_{\max}). \]
当 \(I\) 很小,性能受内存带宽限制;当 \(I\) 足够大,性能才可能接近 tensor core / CUDA core 的算力峰值。
Arithmetic intensity is the ratio between floating-point work and memory traffic: \(I=\text{FLOPs}/\text{bytes moved}\).
例子:
| Operation | Typical bottleneck |
|---|---|
| large matmul | compute-bound if shape is tensor-core friendly |
| elementwise add/relu | memory-bound |
| layernorm/softmax | often memory/reduction-bound |
| small matrix ops in Python loop | launch overhead + poor occupancy |
这解释了一个常见现象:把 elementwise op 换成 FP16 不一定让它快两倍,因为它可能已经被 memory bandwidth 限制;而大矩阵乘如果命中 tensor core,低精度收益会明显得多。
Concrete Intensity Estimates
roofline 的价值在于它能提前告诉你“优化方向是否合理”。先看 elementwise add:
\[ y_i=x_i+z_i. \]
对每个元素大约 1 FLOP。若用 FP32,读 \(x_i,z_i\) 各 4 bytes,写 \(y_i\) 4 bytes,总内存流量约 12 bytes,所以
\[ I_{\text{add}} \approx \frac{1}{12} \text{ FLOP/byte}. \]
这非常低,几乎一定是 memory-bound。把它写成更复杂的 Python 代码不会有用;真正有用的是减少读写次数、fusion、in-place 语义可控时减少中间 tensor。
再看矩阵乘:
\[ C_{M\times N}=A_{M\times K}B_{K\times N}. \]
FLOPs 约为
\[ 2MKN. \]
若粗略只算一次读取 \(A,B\) 和一次写 \(C\),BF16 下 bytes 约为
\[ 2(MK+KN+MN). \]
于是 arithmetic intensity 近似:
\[ I_{\text{matmul}} \approx \frac{2MKN}{2(MK+KN+MN)} = \frac{MKN}{MK+KN+MN}. \]
若 \(M=N=K=4096\),则
\[ I_{\text{matmul}} \approx \frac{4096}{3} \approx 1365 \text{ FLOP/byte}. \]
这就很可能 compute-bound,并且 tensor core、shape 对齐、kernel selection 会比普通内存带宽更关键。
A memory-bound kernel is limited primarily by memory traffic rather than arithmetic throughput, so reducing bytes moved often matters more than reducing FLOPs.
Throughput Units
训练性能不要只报 seconds/step。不同任务需要不同单位:
| Unit | Best for | Caveat |
|---|---|---|
| samples/s | image/classification | sample cost must be similar |
| tokens/s | language modeling | depends on padding/packing |
| optimizer steps/s | training loop overhead | hides batch-size changes |
| TFLOP/s | compute utilization | requires correct FLOP estimate |
| MFU | LLM pretraining | model-specific FLOP convention |
语言模型中,tokens/s 应该使用有效 token 数,而不是 padded sequence length:
\[ \text{tokens/s} = \frac{\sum_{b,t}m_{bt}} \text{step time}. \]
如果一个优化让 padded tokens/s 提高,但有效 tokens/s 不变,可能只是改变了 padding 分布或 batch packing,而不是真正提高模型吞吐。
For sequence models, report effective tokens/s together with padding ratio. Padded-token throughput can make inefficient batching look faster than it is.
Tensor Layout and Memory Locality
Tensor 的核心结构仍是:
\[ \text{Tensor} = (\text{storage},\text{shape},\text{stride},\text{dtype},\text{device}). \]
寻址公式:
\[ \operatorname{idx}(i_0,\ldots,i_{n-1}) = \operatorname{storage\_offset} + \sum_{k=0}^{n-1}i_ks_k. \]
性能上,stride 决定访问是否连续。GPU 中一个 warp 的相邻线程如果访问相邻地址,就能 coalesce 成更高效的 memory transaction;如果访问跨步很大,带宽利用会下降。
Coalesced access means neighboring GPU threads access neighboring memory addresses, allowing hardware to combine memory transactions efficiently.
这解释了为什么 permute 后常需要 contiguous():
x = torch.randn(32, 128, 768, device="cuda")
y = x.transpose(1, 2) # non-contiguous
z = y.contiguous() # physical reordercontiguous() 是一次真实 copy,不要在内层循环里随手调用。更好的做法是让数据 layout 在进入 hot path 前就固定。
Memory Allocation and Peak Tracking
PyTorch 的 CUDA allocator 会缓存显存,所以 nvidia-smi 看到的 reserved memory 不等于当前 live tensor 的真实大小。训练时常看两个量:
torch.cuda.reset_peak_memory_stats()
loss = train_step(batch)
peak = torch.cuda.max_memory_allocated()
reserved = torch.cuda.max_memory_reserved()| Metric | Meaning |
|---|---|
| allocated | live tensors currently held by PyTorch |
| max allocated | peak live tensor memory since reset |
| reserved | memory reserved by caching allocator |
| max reserved | peak reserved memory |
CUDA reserved memory can stay high because PyTorch caches blocks for reuse. A real leak usually shows max_memory_allocated() or live tensor references growing over steps.
常见显存峰值来源:
- activation saved for backward;
- gradients;
- optimizer states;
- temporary tensors from non-fused ops;
- hidden copies from
contiguous(),reshape, advanced indexing, dtype casts; - logging code holding graph-attached tensors in a Python list。
Training Memory Ledger
训练显存可以分成静态和动态两部分。静态部分主要与参数量 \(P\) 有关:
| Item | BF16/FP16 bytes per param | FP32 bytes per param |
|---|---|---|
| parameter | 2 | 4 |
| gradient | 2 | 4 |
| Adam first moment | 4 | 4 |
| Adam second moment | 4 | 4 |
| master weight if used | 4 | 0 or 4 |
典型 AdamW mixed precision 训练中,每个参数可能需要:
\[ 2\text{ bytes param} +2\text{ bytes grad} +4\text{ bytes }m +4\text{ bytes }v +4\text{ bytes master} =16\text{ bytes}. \]
所以一个 \(P=1\)B 参数模型,仅这些状态就可能约为
\[ 16P\approx16\text{ GB}. \]
这还没算 activation、temporary tensors、CUDA workspace 和 fragmentation。推理时没有 gradient/optimizer states,显存账本完全不同。
A memory ledger decomposes peak memory into parameters, gradients, optimizer states, activations, temporary tensors, workspaces, and allocator fragmentation.
Activation Scaling
activation 显存通常随 batch、sequence、hidden size、layer 数增长。粗略写:
\[ M_{\text{act}} \propto B\cdot T\cdot d\cdot L\cdot \text{bytes}. \]
但真实 Transformer 还会保存 attention probabilities、MLP 中间激活、norm 输入、dropout mask 等。训练中 checkpointing 可以少保存一部分 activation,backward 时重新计算:
| Technique | Saves | Pays |
|---|---|---|
| activation checkpointing | activation memory | extra forward compute |
| gradient accumulation | activation per micro-batch | more steps per update |
| sequence packing | padding activation waste | packing complexity |
| FlashAttention | attention matrix memory | kernel constraints |
如果 OOM 随 \(B\) 和 \(T\) 线性变化,通常是 activation;如果 OOM 与 optimizer choice 强相关,通常是 optimizer states;如果 OOM 只出现在某些 shape,可能是 workspace 或 temporary tensor。
Leak Versus Legitimate Growth
一个常见误判是把 allocator warmup 当成 leak。真正的 leak 往往表现为每步 live tensor 持续增长:
torch.cuda.reset_peak_memory_stats()
for step, batch in enumerate(loader):
loss = train_step(batch)
if step % 10 == 0:
torch.cuda.synchronize()
print(
step,
torch.cuda.memory_allocated(),
torch.cuda.max_memory_allocated(),
)如果 memory_allocated() 一直增长,检查:
- 是否把
loss、logits、activation 直接 append 到 Python list; - hook 是否保存了未
.detach()的 output; - validation 是否忘了
torch.no_grad(); - gradient accumulation 是否忘了按计划
zero_grad; - exception path 是否跳过了清理逻辑。
Appending graph-attached tensors for later logging keeps their whole autograd history. Log Python scalars or detached CPU tensors.
CUDA Asynchrony
PyTorch 的 CUDA op 通常异步入队:
y = x @ w
z = y.relu()Python 线程不一定等 GPU 算完才继续。同步会在这些地方发生:
torch.cuda.synchronize();- 从 GPU tensor 取 Python scalar:
loss.item(); - 打印 GPU tensor;
- GPU -> CPU copy;
- 某些错误被延迟到后续同步点才暴露。
Wall-clock timing around CUDA ops without synchronization measures enqueue time, not actual GPU execution time.
正确 benchmark:
import time
import torch
def time_cuda(fn, warmup=10, repeat=50):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(repeat):
fn()
torch.cuda.synchronize()
return (time.perf_counter() - start) / repeat或用 CUDA events:
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
fn()
end.record()
torch.cuda.synchronize()
ms = start.elapsed_time(end)Benchmark Hygiene
一个稳定 benchmark 至少要控制:
- warmup:让 CUDA context、kernel cache、compile/fusion 开销先发生;
- synchronization:测 GPU 执行时间,而不是 enqueue 时间;
- fixed shapes:避免动态 shape 引入额外路径;
- fixed dtype/layout:避免不小心测到 cast/copy;
- enough repeats:小 kernel 的噪声很大;
- no logging in hot path:
item()和 print 会同步。
def benchmark_step(train_step, batch, warmup=5, repeat=20):
for _ in range(warmup):
train_step(batch)
torch.cuda.synchronize()
torch.cuda.reset_peak_memory_stats()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(repeat):
train_step(batch)
end.record()
torch.cuda.synchronize()
ms = start.elapsed_time(end) / repeat
peak = torch.cuda.max_memory_allocated()
return {"ms": ms, "peak_bytes": peak}The first few iterations often include CUDA context creation, kernel selection, memory-pool growth, and graph compilation. Report steady-state time after warmup.
CPU-GPU Transfer
GPU 训练中,host-to-device copy 可能成为隐形瓶颈。典型写法:
loader = DataLoader(
dataset,
batch_size=64,
num_workers=4,
pin_memory=True,
persistent_workers=True,
)
for batch in loader:
batch = {
k: v.to("cuda", non_blocking=True)
for k, v in batch.items()
}pin_memory=True 让 DataLoader 把 batch 放到 page-locked host memory,GPU DMA 拷贝更容易异步。non_blocking=True 只有在 source memory pinned、目标是 CUDA 等条件满足时才真正有意义。
如果 batch 中有很多小 tensor,copy launch overhead 会变大。collate 时尽量把字段合并成少数大 tensor。
Overlapping Copy and Compute
理想情况下,下一批数据的 H2D copy 可以和当前 batch 的 GPU compute 重叠。一个简单 prefetcher 会把 copy 放到单独 CUDA stream:
class CudaPrefetcher:
def __init__(self, loader, device):
self.loader = iter(loader)
self.device = device
self.stream = torch.cuda.Stream()
self.next_batch = None
self.preload()
def preload(self):
try:
batch = next(self.loader)
except StopIteration:
self.next_batch = None
return
with torch.cuda.stream(self.stream):
self.next_batch = {
k: v.to(self.device, non_blocking=True)
for k, v in batch.items()
}
def next(self):
torch.cuda.current_stream().wait_stream(self.stream)
batch = self.next_batch
self.preload()
return batch这段代码只有在 DataLoader、pinned memory、batch tensor 数量和 GPU compute 足够长时才有收益。若 copy 本身很短,prefetcher 可能只是增加复杂度。
When using a separate CUDA stream, the source batch must stay alive until the copy finishes. Prematurely freeing or mutating host tensors can cause subtle bugs.
Kernel Launch and Fusion
GPU 适合大矩阵/卷积,不适合大量很小的 Python-level op:
# slow pattern: many tiny kernels
for i in range(x.shape[0]):
y[i] = torch.relu(x[i] @ w)更好的写法是 batch 化:
y = torch.relu(x @ w)每个 PyTorch op 可能对应一个或多个 kernel launch。kernel launch 有固定开销;小 tensor 上的很多逐元素 op 会被 launch overhead 和 memory bandwidth 主导。
torch.compile、TorchScript、nvFuser、Triton kernel 和 fused optimizer 的目标都是减少 Python 调度和 kernel 数量:
compiled_model = torch.compile(model)但 compile 不是免费午餐:动态 shape、Python control flow、数据依赖分支、非标准 op 都可能导致 graph break。使用时要对比:
- compile time;
- steady-state step time;
- graph break 数量;
- numerics 是否一致。
torch.compile Cost Model
torch.compile 的收益来自把 Python + eager op 序列变成更大的 graph,让编译器做 fusion、layout planning 和 kernel selection。但它有三类成本:
| Cost | When it appears | Symptom |
|---|---|---|
| compile latency | first time seeing a graph/shape | first step very slow |
| graph break | unsupported Python/op boundary | many small compiled regions |
| recompilation | new dynamic shape/control path | periodic long steps |
因此 benchmark compile 时必须分开报告:
\[ t_{\text{first}}, \qquad t_{\text{steady}}, \qquad N_{\text{graphs}}. \]
如果一个训练只跑几十步,compile latency 可能无法摊销;如果是长时间训练,steady-state 才重要。
一个常见诊断流程:
model = torch.compile(model, fullgraph=False)
for _ in range(warmup_compile_steps):
train_step(batch)
result = benchmark_step(train_step, batch)然后打开 graph break 日志或解释工具,定位哪些 Python 语句让 graph 断开。典型 graph break 来源:
- tensor-dependent Python
if; - 动态创建不同 shape 的 tensor;
.item()把 GPU 值拉回 Python;- list/dict control flow 依赖 tensor 值;
- 自定义 op 没有 decomposition。
Do not report a compile optimization without separating compile latency from steady-state step time and checking for recompilation under real input shapes.
Dynamic Shapes
NLP 训练常有变长序列。如果每个 batch 的 \(T\) 都不同,编译器可能看到许多 shape。两种常见处理:
| Strategy | Benefit | Cost |
|---|---|---|
| pad to fixed length | fewer graphs, stable kernels | padding waste |
| bucket by length | fewer shapes, less waste | sampler/collate complexity |
| dynamic-shape compile | fewer recompiles | may reduce optimization opportunities |
这和 padding/packing 不只是数据管线问题,也会影响 kernel selection 和 compile cache。性能实验要同时记录 sequence length distribution,否则不同 run 的 step time 很难比较。
Shape Matters for Tensor Cores
低精度矩阵乘快,前提是 shape 能有效使用 tensor cores。通常大而规则的矩阵更容易跑满:
[B*T, d] @ [d, 4d]
比很多小矩阵循环更好:
for token in T:
[B, d] @ [d, 4d]
经验上,hidden size、head dim、batch-token product 对齐到硬件友好的倍数,往往比“理论 FLOPs 一样”更重要。比如把多个 token 或多个样本合并到同一次 GEMM,能提高 occupancy,降低 launch overhead。
Fusion: What It Actually Saves
融合通常节省三类成本:
- Python dispatch;
- kernel launch;
- 中间 tensor 的 HBM 读写。
例如
\[ y=\operatorname{dropout}(\operatorname{gelu}(xW+b)) \]
如果每一步都 materialize 中间 tensor,就会反复写回 HBM。融合 kernel 可以在寄存器/shared memory 中完成更多步骤。性能收益最大时,往往不是 FLOPs 少了,而是 memory traffic 少了。
Kernel fusion combines multiple tensor operations into fewer kernels, reducing launch overhead and intermediate memory traffic.
Mixed Precision
FP16/BF16 的性能收益来自两点:
- tensor core 对低精度矩阵乘更快;
- activation/gradient 占用更少显存和带宽。
典型 AMP:
scaler = torch.cuda.amp.GradScaler()
for batch in loader:
optimizer.zero_grad(set_to_none=True)
with torch.autocast("cuda", dtype=torch.float16):
loss = compute_loss(model, batch)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()BF16 常不需要 GradScaler,因为 exponent 范围比 FP16 大:
with torch.autocast("cuda", dtype=torch.bfloat16):
loss = compute_loss(model, batch)
loss.backward()
optimizer.step()混合精度不是简单把所有东西变成 half。常见约定:
| State | Common dtype |
|---|---|
| model weights | FP16/BF16 or FP32 master |
| optimizer moments | FP32 |
| loss reductions | FP32 |
| logits softmax | often FP32 internally |
| labels/indices | int64 |
Gradient Scaling and Overflow
FP16 的 exponent range 小,梯度可能 underflow;loss scaling 把 loss 乘上 scale 再 backward:
\[ \tilde{L}=sL, \qquad \nabla\tilde{L}=s\nabla L. \]
optimizer step 前再把梯度除以 \(s\)。如果梯度出现 inf/NaN,GradScaler 会跳过 step 并降低 scale。
scale_before = scaler.get_scale()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
scale_after = scaler.get_scale()
overflow = scale_after < scale_before如果 scale 频繁下降,说明数值不稳定。常见原因包括 LR 太大、loss reduction 不合适、softmax/logits 溢出、norm 层 dtype 不稳。
Mixed Precision Audit
混合精度启用后至少检查:
- loss 曲线是否和 FP32 baseline 接近;
- throughput 是否真的提高;
- peak memory 是否下降;
- GradScaler 是否频繁 overflow;
- norm/softmax/loss 是否保持稳定 dtype;
- labels、indices、mask 是否没有被误 cast。
DataLoader Bottlenecks
DataLoader 性能取决于 dataset、collate、worker、IO。一个简单测量:
import time
end = time.perf_counter()
for step, batch in enumerate(loader):
data_time = time.perf_counter() - end
loss = train_step(batch)
step_time = time.perf_counter() - end
end = time.perf_counter()若 data_time 占比高,优先检查:
num_workers是否太少或太多;- 图像解码/文本 tokenization 是否在线执行;
- collate 是否做了 Python 循环和频繁小 tensor 分配;
- 是否反复从慢磁盘/网络读取;
- batch 是否包含大量无法 pinned/copy 的 Python 对象。
If tokenization, resizing, or feature extraction is deterministic and reused across epochs, cache it offline or in an indexed binary format instead of recomputing inside every worker.
Worker Tuning and Queueing
DataLoader 可以看成一个 producer-consumer queue。CPU workers 生产 batch,GPU 消费 batch。若平均生产时间 \(t_{\text{data}}\) 大于 GPU step 时间 \(t_{\text{gpu}}\),GPU 就会等数据:
\[ t_{\text{step}} \approx \max(t_{\text{data}},t_{\text{gpu}}) \]
当 prefetch queue 足够深、workers 足够多时,理想情况是 \(t_{\text{data}}\) 被隐藏在上一轮 GPU compute 后面。常调参数:
| Knob | Effect | Failure mode |
|---|---|---|
num_workers |
parallel CPU loading | too many workers cause context overhead/RAM pressure |
prefetch_factor |
queue depth per worker | too high increases RAM |
persistent_workers |
avoids worker restart each epoch | stale state if dataset mutates |
pin_memory |
faster async H2D path | only helps tensor batches |
custom collate_fn |
controls padding/stacking | Python loops can dominate |
Increasing num_workers can slow training when CPU cores, RAM bandwidth, disk IO, or Python serialization become bottlenecks.
Collate Cost
如果 collate 做大量 Python 小操作,比如逐样本 tokenize、逐元素 append、反复创建小 tensor,它会拖慢整个 pipeline。更好的策略:
- 离线预处理 deterministic work;
- 在 dataset 中返回 NumPy arrays 或 already-shaped tensors;
- 在 collate 里一次性 pad/stack;
- 避免 batch 里塞复杂 Python object;
- 对 NLP,尽量使用 batched tokenizer 或预 tokenized cache。
Parallel Training
数据并行的基本步骤:
split batch across GPUs
-> each GPU forward/backward
-> all-reduce gradients
-> each GPU optimizer step
若有 \(N\) 张卡,每张卡 micro-batch 为 \(B\),gradient accumulation 为 \(K\),global batch 是:
\[ B_{\text{global}} = N\times B\times K. \]
DDP 中每个 parameter 的 gradient bucket 会被 all-reduce。通信量大致与参数量同阶,而不是与 activation 量同阶。大模型训练中常见优化:
| Method | Saves | Cost |
|---|---|---|
| gradient accumulation | communication frequency | longer optimizer interval |
| bucket tuning | overlap comm/compute | tuning complexity |
| FSDP/ZeRO | optimizer/grad/param memory | more communication |
| tensor parallelism | per-device matmul size | collective inside layers |
| pipeline parallelism | layer memory | bubbles and schedule complexity |
本课程本地机器不适合大规模分布式训练,但理解 DDP 的同步语义有助于读懂现代 LLM 训练系统。
Gradient Accumulation Timing
Gradient accumulation 改变的不只是 batch size,也改变通信和 optimizer 的时序:
optimizer.zero_grad(set_to_none=True)
for micro in range(accum_steps):
loss = compute_loss(model, batch[micro]) / accum_steps
loss.backward()
optimizer.step()DDP 中,如果每个 micro-step 都 all-reduce,通信会很频繁。实际训练常用 no_sync() 让前几个 micro-step 不同步,最后一个 micro-step 再同步:
from contextlib import nullcontext
for micro in range(accum_steps):
ctx = model.no_sync() if micro < accum_steps - 1 else nullcontext()
with ctx:
loss = compute_loss(model, batch[micro]) / accum_steps
loss.backward()
optimizer.step()When using gradient accumulation, LR scheduler, logging, and global-step counters should usually advance on optimizer steps, not micro-steps.
Communication/Computation Overlap
DDP 把 gradients 打包成 buckets。某个 bucket 的梯度都算完后,就可以开始 all-reduce,同时后面的 layer 还在 backward。这叫 overlap communication with computation。若 bucket 太小,launch/communication overhead 多;bucket 太大,overlap 变差。
读 profiler 时,如果 backward 后面拖着很长的 all-reduce 尾巴,说明通信没有被很好隐藏;如果 GPU compute 中间夹着许多小通信,也可能是 bucket 或模型切分不合适。
Profiling Workflow
不要一上来猜瓶颈。推荐顺序:
- 先跑 correctness smoke test;
- 固定 batch 和 seed;
- 测 step time、data time、GPU utilization、显存峰值;
- 用
torch.profiler看 CPU/GPU 时间; - 一次只改一个性能变量;
- 对比吞吐和数值结果。
最小 profiler:
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
profile_memory=True,
) as prof:
for step, batch in enumerate(loader):
train_step(batch)
prof.step()
if step >= 10:
break
print(prof.key_averages().table(sort_by="cuda_time_total"))Profiler Schedule and Trace Export
真实训练不一定要从第 0 步开始 profile,可以跳过 warmup,只记录稳定阶段:
schedule = torch.profiler.schedule(wait=2, warmup=2, active=4, repeat=1)
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
schedule=schedule,
record_shapes=True,
profile_memory=True,
with_stack=False,
on_trace_ready=torch.profiler.tensorboard_trace_handler("./tb_trace"),
) as prof:
for step, batch in enumerate(loader):
train_step(batch)
prof.step()
if step >= 10:
breakProfiler 输出要结合三种视角:
| View | Question |
|---|---|
| operator table | 哪些 op 总耗时最高 |
| timeline trace | CPU 是否在喂饱 GPU |
| memory profile | 哪些 op 造成峰值分配 |
Optimization Loop
性能优化也需要实验纪律:
- 固定 input shape、batch、seed;
- 保存 baseline:step time、tokens/s、samples/s、peak memory、loss;
- 提出一个瓶颈假设;
- 只改一个变量;
- 重测并比较置信区间或多次均值;
- 若 throughput 提升但 loss 变坏,不算成功;
- 若只改善 first-step time,不要当成 steady-state improvement。
Performance Smoke Tests
性能代码也需要 smoke tests。它们不证明最优,但能防止“benchmark 测错对象”或“优化改变语义”。
Test 1: CUDA Timing Harness
import time
def measure_cuda_seconds(fn, warmup=5, repeat=20):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(repeat):
fn()
torch.cuda.synchronize()
elapsed = (time.perf_counter() - start) / repeat
if elapsed <= 0:
raise AssertionError("invalid timing result")
return elapsed这个测试的重点不是比较具体数值,而是把 warmup、repeat 和 synchronization 固定成一个统一入口。真实项目里更推荐封装统一的 benchmark_step,禁止各处手写计时。
Test 2: Optimization Preserves Numerics
无论是 torch.compile、AMP、fusion,还是换 layout,都要先在小 batch 上比较输出或 loss:
def assert_close_loss(step_a, step_b, batch, atol=1e-4, rtol=1e-4):
with torch.no_grad():
loss_a = step_a(batch).detach().float().cpu()
loss_b = step_b(batch).detach().float().cpu()
if not torch.allclose(loss_a, loss_b, atol=atol, rtol=rtol):
raise AssertionError((loss_a.item(), loss_b.item()))对随机层要先 model.eval() 或固定 dropout seed。吞吐提升但 loss 语义变了,不是性能优化。
Test 3: No Graph-Tensor Logging Leak
def assert_logged_tensors_detached(log_items):
for item in log_items:
if torch.is_tensor(item) and item.grad_fn is not None:
raise AssertionError("logged tensor still has autograd history")如果你需要保存样本输出用于分析:
saved.append(logits.detach().float().cpu())不要保存 raw logits。否则显存增长可能来自 Python list 持有整个计算图。
Test 4: Dataloader Is Not the Bottleneck
def measure_data_fraction(loader, train_step, steps):
end = time.perf_counter()
data_total = 0.0
step_total = 0.0
for i, batch in enumerate(loader):
data_done = time.perf_counter()
data_total += data_done - end
train_step(batch)
if torch.cuda.is_available():
torch.cuda.synchronize()
step_end = time.perf_counter()
step_total += step_end - data_done
end = time.perf_counter()
if i + 1 >= steps:
break
return data_total / max(data_total + step_total, 1e-12)如果 data fraction 很高,先优化 dataset/collate/IO;如果很低,继续看 GPU kernel 和 memory。这个测试让“GPU 利用率低”有一个可复现的入口。
Test 5: Peak Memory Budget
def assert_peak_memory_under(train_step, batch, budget_bytes):
torch.cuda.reset_peak_memory_stats()
train_step(batch)
torch.cuda.synchronize()
peak = torch.cuda.max_memory_allocated()
if peak > budget_bytes:
raise AssertionError(f"peak {peak} > budget {budget_bytes}")这个测试适合保护教学实验和作业脚本:改 batch、sequence length、模型宽度后,至少知道是否突破了本机显存预算。
A useful performance test measures one boundary at a time: dataloader wait, H2D copy, forward/backward/step, peak memory, and logging overhead.
Implementation Checklist
优化 PyTorch 性能时至少检查:
- CUDA benchmark 是否 warmup + synchronize;
- 是否记录 steady-state step time,而不是 first-step time;
loss.item()、print、CPU copy 是否出现在 hot path;- DataLoader 是否让 GPU 等 batch;
- H2D copy 是否使用 pinned memory + non-blocking;
- batch 是否由大量小 tensor 导致 copy/launch overhead;
- tensor layout 是否导致频繁
contiguous(); - 小 op 是否可以 batch 化或 fusion;
- matmul/conv shape 是否能有效利用 tensor cores;
- AMP/BF16 是否真的减少 step time 和显存;
- GradScaler 是否频繁 overflow;
- peak allocated memory 是否被 activation、temp copy 或 logging 引用撑高;
- DDP global batch、accumulation、scheduler step 是否一致;
- profiler 证据是否支持当前优化方向;
torch.compile是否区分 first-step latency 和 steady-state;- dynamic shape 是否导致 recompilation 或 padding 浪费;
- throughput 是否按 effective tokens/s 或正确任务单位报告;
- 性能优化后 loss/metrics 是否仍与 baseline 对齐。