1.1 Tensor Foundations

PyTorch 的核心对象是 Tensor。初学时可以把它看成多维数组；真正写训练代码时，还必须同时看到四层语义：

mathematical object: scalar/vector/matrix/high-order array；
storage object: 一块一维内存加上 shape/stride/dtype/device；
autograd object: 可能带计算图、梯度和 leaf/non-leaf 状态；
kernel object: 每个操作最终要调度 CPU/GPU kernel。

如果只记住“tensor 是多维数组”，很多 bug 会显得很玄学：view() 为什么报错、from_numpy() 为什么改了原数组、expand() 为什么不能原地写、.to("cuda") 为什么没有改变原变量、detach() 为什么仍然共享内存。这一节把这些问题统一到一个模型里。

Definition: Tensor in PyTorch

A PyTorch tensor is a typed, device-resident, strided view over a storage buffer, optionally participating in an autograd computation graph.

Mathematical View

数学上，张量可以理解为多重线性对象；在深度学习实现里，我们主要使用它的坐标表示。常见 rank：

Object	Shape example	Deep-learning meaning
scalar	`()`	loss, learning rate, logit for one class
vector	`(d,)`	embedding, feature vector
matrix	`(m, n)`	linear layer weight, token-by-channel table
3D tensor	`(B, T, C)`	batch of token hidden states
4D tensor	`(B, C, H, W)`	image mini-batch

注意这里的 rank 是“轴的数量”，不是矩阵 rank。torch.Tensor.dim() 返回的是这个意义下的维度数。

Construction Semantics

常见构造函数的差异不只是写法不同，而是是否复制内存、是否共享外部对象、是否参与 autograd。

Constructor	Copies data?	Shares memory?	Typical use
`torch.tensor(data)`	usually yes	no	从 Python list/array 创建独立 tensor
`torch.as_tensor(data)`	maybe no	yes if possible	尽量避免复制
`torch.from_numpy(arr)`	no	yes	NumPy 到 Torch 的零拷贝桥接
`torch.empty(shape)`	alloc only	no	只分配，不初始化
`torch.zeros/ones/full`	alloc + fill	no	明确初始化
`torch.arange/linspace`	alloc	no	构造坐标或索引

import numpy as np
import torch

arr = np.array([1, 2, 3], dtype=np.float32)
x = torch.from_numpy(arr)
arr[0] = 99
assert x[0].item() == 99

这个例子说明 from_numpy 共享内存。共享内存可以省复制，但也意味着外部数组被改动会影响 tensor。

Pitfall: empty Means Uninitialized

torch.empty only allocates memory. Its values are whatever happened to be in that memory region. Use it only when you will overwrite every element before reading.

Copy, Alias, and Gradient Intent

构造 tensor 时有三个问题要同时想清楚：

是否复制 data buffer；
是否和外部对象共享 storage；
是否需要 autograd 追踪。

例如：

arr = np.arange(6, dtype=np.float32).reshape(2, 3)
x = torch.from_numpy(arr)        # shares memory
y = torch.tensor(arr)            # copies
z = torch.as_tensor(arr)         # shares when possible

如果源对象是另一个 tensor，torch.tensor(x) 会复制并断开 autograd 历史；更清晰的写法是：

y = x.clone()                    # copy, keep grad relation
z = x.detach().clone()           # copy, detach from graph

Pitfall: torch.tensor(existing_tensor)

torch.tensor(existing_tensor) makes a copy and detaches from the original graph. Prefer clone(), detach(), or detach().clone() to state the intended semantics explicitly.

Tensor Metadata

一个 tensor 至少有这些元信息：

x = torch.randn(2, 3, 4, device="cpu", dtype=torch.float32)

x.shape          # torch.Size([2, 3, 4])
x.stride()       # e.g. (12, 4, 1)
x.dtype          # torch.float32
x.device         # cpu
x.requires_grad  # False by default
x.numel()        # 24
x.element_size() # 4 bytes for float32

内存量近似为：

\[ \operatorname{bytes}(x) = \operatorname{numel}(x)\times \operatorname{element\_size}(x). \]

这只是数据 buffer 的大小，不包括 autograd graph、temporary tensors、optimizer state 和 allocator fragmentation。

Storage, Shape, Stride

PyTorch dense tensor 的寻址公式是：

\[ \operatorname{addr}(i_0,\ldots,i_{n-1}) = \operatorname{base} + \left( \operatorname{storage\_offset} + \sum_{k=0}^{n-1}i_k s_k \right) \cdot \operatorname{sizeof(dtype)}, \]

其中 \(s_k\) 是第 \(k\) 个轴的 stride，单位是元素个数，不是字节。

对于 contiguous row-major tensor，shape 为 \((d_0,\ldots,d_{n-1})\) 时：

\[ s_{n-1}=1, \qquad s_k=d_{k+1}s_{k+1}. \]

例如 (B, T, C) 的 contiguous stride 是：

\[ (TC,\ C,\ 1). \]

也就是说，最后一维相邻元素在内存中相邻。这也是为什么很多 kernel 喜欢 channel/hidden dimension 在最后：相邻线程更容易读取连续地址。

Definition: Contiguous Tensor

A tensor is contiguous when its strides match the default dense row-major layout for its shape.

Worked Address Example

设：

x = torch.arange(12).reshape(3, 4)
y = x[1:, 1:3]

x 的 shape 是 (3,4)，stride 是 (4,1)。切片 y 的 shape 是 (2,2)，stride 仍是 (4,1)，但 storage offset 变成 \(5\)，因为 x[1,1] 是原 storage 的第 \(1\cdot4+1=5\) 个元素。

所以：

\[ y_{i,j} \leftrightarrow \operatorname{storage}[5+4i+j]. \]

assert y.storage_offset() == 5
assert y.stride() == (4, 1)

这就是为什么 slicing 通常很便宜：它只创建新的 metadata。代价是后续 kernel 可能面对非连续访问。

`as_strided` and Overlapping Views

as_strided 可以手动指定 shape、stride、offset，是所有 view trick 的底层形式：

x = torch.arange(6)
windows = x.as_strided(size=(4, 3), stride=(1, 1))

windows 逻辑上是：

[[0, 1, 2],
 [1, 2, 3],
 [2, 3, 4],
 [3, 4, 5]]

这里不同逻辑位置会指向同一 storage 元素。比如 windows[0,1] 和 windows[1,0] 都是 x[1]。

Definition: Overlapping View

An overlapping view is a tensor view where two or more logical indices map to the same storage location.

Pitfall: as_strided Can Express Invalid Semantics

as_strided can create overlapping or out-of-bounds-looking logical layouts. It is powerful for implementing unfold/window views, but in-place writes on overlapping views are usually wrong.

Views and Copies

View 操作只改 metadata，不复制 data buffer。典型 view 操作：

x = torch.arange(12).reshape(3, 4)
y = x[:, 1:3]
z = x.transpose(0, 1)

y 和 z 仍然指向 x 的 storage，只是 shape、stride、offset 不同。可以用 storage pointer 做检查：

same_storage = (
    x.untyped_storage().data_ptr()
    == y.untyped_storage().data_ptr()
)

view() 要求 tensor contiguous，因为它只重解释 shape；reshape() 更宽松，必要时会复制：

x = torch.arange(12).reshape(3, 4)
y = x[:, ::2]              # non-contiguous

# y.view(-1)               # may fail
z = y.reshape(-1)          # may copy
w = y.contiguous().view(-1)

Pitfall: reshape May Hide a Copy

reshape returns a view when possible and a copy when necessary. If memory aliasing or performance matters, check .is_contiguous() and storage pointers explicitly.

Slicing vs. Advanced Indexing

Basic slicing 通常返回 view：

x = torch.arange(12).reshape(3, 4)
a = x[:, 1:3]        # view

Advanced indexing 通常返回 copy：

idx = torch.tensor([0, 2])
b = x[idx]           # copy
mask = x > 5
c = x[mask]          # copy, flattened selected values

可以用 storage pointer 检查：

def shares_storage(a, b):
    return a.untyped_storage().data_ptr() == b.untyped_storage().data_ptr()

assert shares_storage(x, a)
assert not shares_storage(x, b)

为什么重要？如果你以为 b = x[idx] 是 view，然后修改 b 期望回写到 x，结果不会发生；如果你以为它是便宜操作，在大 batch 上可能悄悄产生巨大 copy。

Permute, Contiguous, and Memory Format

permute / transpose 只改 stride：

x = torch.randn(2, 3, 4)
y = x.permute(0, 2, 1)
assert y.shape == (2, 4, 3)
assert not y.is_contiguous()

如果某个 kernel 或 .view() 需要 contiguous，就要显式 materialize：

z = y.contiguous()

对图像张量，除了默认 contiguous NCHW，还有 channels_last memory format：

img = torch.randn(8, 3, 224, 224)
img_cl = img.to(memory_format=torch.channels_last)
assert img_cl.is_contiguous(memory_format=torch.channels_last)

channels_last 的 shape 仍然是 [B,C,H,W]，只是 stride 更像 NHWC。它可能让 GPU convolution 更快，但自定义 kernel 或错误的 .view() 很容易假设错布局。

Pitfall: Same Shape, Different Stride

Two tensors can have the same shape and dtype but very different strides. Shape checks alone do not prove a tensor is compatible with a low-level kernel.

Broadcasting as Stride Logic

Broadcasting 从右往左比较 shape。两个维度可广播，当且仅当它们相等，或其中一个为 1，或其中一个维度缺失。

a = torch.randn(3, 1)
b = torch.randn(1, 4)
c = a + b       # shape (3, 4)

广播通常不复制小 tensor，而是使用 stride trick。expand() 可以显式看到这个语义：

x = torch.arange(3).reshape(3, 1)
y = x.expand(3, 4)
assert y.stride()[1] == 0

stride 为 0 表示沿该维移动时地址不变，所以多个逻辑位置共享同一个物理元素。

Pitfall: Expanded Views Are Not Writable Like Copies

An expanded dimension with stride 0 aliases the same storage location many times. In-place writes on expanded tensors are often invalid or semantically ambiguous.

Dtype and Device

dtype 决定数值范围、精度和 kernel 选择：

dtype	Bytes	Typical role
`float32`	4	default training weights, stable reductions
`float16`	2	mixed precision on GPU
`bfloat16`	2	mixed precision with wider exponent
`int64`	8	token ids, class labels, indices
`bool`	1	masks

设备转换会返回新 tensor，除非已经在目标设备且 dtype 一致：

x = torch.randn(4)
y = x.to("cuda")

这里 x 仍在 CPU，y 在 GPU。写训练代码时要避免这种隐式分叉：

batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}

non_blocking=True 只有在 pinned host memory 等条件满足时才可能异步；它不是魔法开关。

Dtype Promotion

当两个 tensor dtype 不同，PyTorch 会根据 promotion rules 选择结果 dtype：

a = torch.ones(3, dtype=torch.float16)
b = torch.ones(3, dtype=torch.float32)
c = a + b
assert c.dtype == torch.float32

promotion 的目标是避免明显丢精度，但它也可能让你以为在 FP16 里跑，实际某些中间结果升到了 FP32。常见规则直觉：

Operation	Typical result
`float16 + float32`	`float32`
`int64 + float32`	`float32`
`bool & bool`	`bool`
reductions like `sum`	may accumulate in promoted dtype depending on op/device

对 loss、softmax、normalization、large reductions，dtype 选择尤其重要。FP16 的 exponent range 小，容易 overflow；BF16 精度少但 exponent range 接近 FP32，更适合大模型训练。

Definition: Dtype Promotion

Dtype promotion is the rule system that chooses the output dtype of an operation from the input dtypes.

Autocast and Explicit Dtype Boundaries

Mixed precision 训练通常用 autocast：

scaler = torch.cuda.amp.GradScaler()

with torch.cuda.amp.autocast(dtype=torch.float16):
    logits = model(x)
    loss = criterion(logits, y)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

autocast 不是把所有东西都变成 FP16，而是按 op policy 选择 dtype：矩阵乘、卷积倾向低精度；softmax、norm、loss reduction 往往需要更稳定的精度。

Pitfall: Casting Labels Breaks Losses

Class labels and token ids should usually stay integer, often torch.long. Do not cast an entire batch to FP16 if it contains labels, indices, or masks with semantic dtypes.

Device Transfer and Synchronization

CUDA kernel launch 通常是异步的。下面代码只排队 kernel，不一定等待计算完成：

y = model(x_cuda)

但这些操作会同步 CPU 和 GPU：

loss_value = loss.item()
print(tensor_cuda)
arr = tensor_cuda.cpu().numpy()
torch.cuda.synchronize()

同步点会让计时结果失真。正确测 GPU 时间要么显式 synchronize，要么用 CUDA events：

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
y = model(x_cuda)
end.record()
torch.cuda.synchronize()
print(start.elapsed_time(end), "ms")

Pitfall: item() in the Training Loop

Calling .item() every step forces a CPU-GPU synchronization. For logging, aggregate on device or log less frequently when performance matters.

Autograd Metadata

一个参与 autograd 的 tensor 还关心：

x.requires_grad
x.grad
x.grad_fn
x.is_leaf

叶子 tensor 通常是用户直接创建并设置 requires_grad=True 的 tensor，optimizer 更新的也是 leaf parameters。

w = torch.randn(3, requires_grad=True)
y = (w * 2).sum()
y.backward()
assert w.grad is not None

detach() 返回一个与原 tensor 共享 storage、但不再连接 autograd graph 的 view：

x = torch.randn(3, requires_grad=True)
y = x.detach()

如果之后要独立修改，通常写：

y = x.detach().clone()

Pitfall: detach Is Not a Copy

detach() removes autograd history but can still share storage. Use detach().clone() when you need both graph separation and memory independence.

Leaf, Non-Leaf, and Retained Gradients

Autograd 默认只把 .grad 填到 leaf tensors 上。中间 tensor 虽然参与反向传播，但 .grad 通常是 None：

w = torch.randn(3, requires_grad=True)
h = w * 2
loss = h.square().sum()
loss.backward()

assert w.grad is not None
assert h.grad is None

如果调试时确实要看中间梯度：

w = torch.randn(3, requires_grad=True)
h = w * 2
h.retain_grad()
loss = h.square().sum()
loss.backward()
assert h.grad is not None

Definition: Leaf Tensor

A leaf tensor is a tensor that is created by the user and is not the result of an autograd-tracked operation. Optimizers usually update leaf parameters.

`no_grad`, `inference_mode`, and `.data`

torch.no_grad() 禁止记录新的 autograd graph，常用于验证和推理：

model.eval()
with torch.no_grad():
    logits = model(x)

torch.inference_mode() 更强：它还禁用一些 autograd metadata/version tracking，通常推理更快，但不适合在上下文内创建之后还要参与训练的 tensor。

with torch.inference_mode():
    logits = model(x)

不要用 .data 绕过 autograd：

# Bad: may silently corrupt autograd assumptions
param.data.add_(noise)

# Better:
with torch.no_grad():
    param.add_(noise)

Pitfall: .data Can Silently Corrupt Gradients

.data bypasses autograd’s version counter and graph safety checks. Prefer with torch.no_grad(): for intentional parameter updates outside optimizer logic.

Multiple Backward Calls

默认情况下，backward() 后 autograd graph 会被释放：

loss = model(x).sum()
loss.backward()
# loss.backward()  # would fail without retain_graph=True

若确实需要对同一个 graph 反向多次：

loss.backward(retain_graph=True)

但这会保留中间 activation，占用更多显存。多数训练循环不应该需要它；若你发现自己频繁写 retain_graph=True，通常说明 loss 组织或 detach 边界有问题。

In-Place Operations

PyTorch 中带下划线的函数通常是 in-place：

x.add_(1)
x.relu_()

in-place 可以省内存，但会影响 autograd 保存的中间值。若某个 backward 需要原始值，而你提前改掉了它，PyTorch 会报 version counter 错误。

更稳的规则：

optimizer step 内部可以 in-place 更新 parameter；
activation 上的 in-place op 要确认对应 backward 支持；
对 leaf tensor 做 in-place op 时要特别小心；
debug 时优先用 out-of-place 版本。

Version Counter Intuition

Autograd 保存 backward 所需的 tensor 时，会记录 version counter。任何 in-place 修改都会增加版本号。如果 backward 发现版本号不匹配，就说明某个需要的值被改过：

x = torch.randn(4, requires_grad=True)
y = x.sigmoid()
y.add_(1.0)
# y.sum().backward() may fail because sigmoid backward needs original y

这类错误不是 PyTorch “太严格”，而是在保护数学语义。比如 sigmoid backward 需要

\[ \frac{d\sigma}{dx} = \sigma(x)(1-\sigma(x)), \]

如果你把 \(\sigma(x)\) 原地改掉，梯度就无法正确计算。

Pitfall: In-Place Activation Is a Contract

relu_() may be safe in some networks, but in-place ops are only safe when backward does not need the overwritten value or PyTorch has an implementation that handles it.

Minimal Checks

写 tensor-heavy 代码时，建议在关键边界加这些检查：

def check_batch(x, *, device, dtype, ndim):
    assert x.device.type == device
    assert x.dtype == dtype
    assert x.dim() == ndim
    assert x.isfinite().all()

对 shape/stride：

def describe(x):
    return {
        "shape": tuple(x.shape),
        "stride": x.stride(),
        "dtype": str(x.dtype),
        "device": str(x.device),
        "contiguous": x.is_contiguous(),
        "offset": x.storage_offset(),
        "data_ptr": x.untyped_storage().data_ptr(),
    }

这些检查看起来琐碎，却能在模型跑满 GPU 前发现最常见的错误：错误 dtype、错误 device、错误 mask shape、意外 copy、非 contiguous tensor 进入只支持 contiguous 的 kernel。

Implementation Checklist

写 tensor-heavy 代码时检查：

构造 tensor 时是否明确 copy / alias / detach 语义；
shape、stride、storage offset 是否符合预期；
slicing / indexing 是否意外触发 copy；
.view() 前 tensor 是否 contiguous；
permute 后是否需要 .contiguous() 或 .reshape()；
broadcasting 后是否出现 stride-0 expanded view；
dtype promotion 是否符合 mixed precision 预期；
labels、indices、masks 是否保持语义 dtype；
CPU-GPU transfer 是否有必要，是否可能异步；
训练循环里是否有频繁 .item() / .cpu() 同步；
leaf parameter 的 .grad 是否按预期被填充；
中间 tensor 梯度调试是否使用 retain_grad()；
推理是否用 no_grad() 或 inference_mode()；
是否避免 .data 和危险 in-place 修改；
custom kernel 是否显式声明 layout/dtype/device contract。

Mathematical View

Construction Semantics

Copy, Alias, and Gradient Intent

Tensor Metadata

Storage, Shape, Stride

Worked Address Example

as_strided and Overlapping Views

Views and Copies

Slicing vs. Advanced Indexing

Permute, Contiguous, and Memory Format

Broadcasting as Stride Logic

Dtype and Device

Dtype Promotion

Autocast and Explicit Dtype Boundaries

Device Transfer and Synchronization

Autograd Metadata

Leaf, Non-Leaf, and Retained Gradients

no_grad, inference_mode, and .data

Multiple Backward Calls

In-Place Operations

Version Counter Intuition

Minimal Checks

Implementation Checklist

`as_strided` and Overlapping Views

`no_grad`, `inference_mode`, and `.data`