4.5 Decoder-Only Transformers and GPT-2

GPT-2 是理解现代 LLM 的一个很好的锚点：它不是今天最大的模型，也不是今天最强的模型，但它把后来的很多核心约定都固定了下来：decoder-only architecture、causal self-attention、byte-level BPE、next-token prediction、KV cache、prompt-as-task、zero-shot transfer。学 GPT-2 的目的不是背一个老模型，而是学会把 LLM 拆成三个层次：

architecture: token embedding、position embedding、causal Transformer blocks、LM head；
objective: 用 chain rule 把任意文本序列变成 next-token prediction；
training/inference mechanics: padding、label shifting、loss mask、packing、KV cache、sampling。

如果只说 “GPT-2 是 decoder-only Transformer”，这句话几乎没有教学价值。真正重要的是：decoder-only 的 mask 定义了一个自回归概率分解，而训练工程细节必须和这个概率分解一致。

From Language Modeling to Decoder-Only

Definition: Autoregressive Language Model

An autoregressive language model represents a token sequence \(x_{1:T}\) by the chain-rule factorization \[ p_\theta(x_{1:T}) = \prod_{t=1}^{T}p_\theta(x_t\mid x_{<t}). \] The model is trained to predict each next token from its prefix.

这个定义其实没有要求模型一定是 Transformer。RNN 也可以做 autoregressive LM。但 Transformer 的优势在于训练时可以并行计算所有位置的 hidden states，同时用 causal mask 保证第 \(t\) 个位置不能偷看未来 token。

以句子：

The cat sat

为例，如果 tokenizer 产出

\[ x=[\texttt{The},\texttt{ cat},\texttt{ sat},\texttt{<eos>}], \]

那么训练目标不是让模型“理解整句话”，而是把它拆成一串条件分布：

\[ p(x) = p(\texttt{The}) p(\texttt{ cat}\mid\texttt{The}) p(\texttt{ sat}\mid\texttt{The},\texttt{ cat}) p(\texttt{<eos>}\mid\texttt{The},\texttt{ cat},\texttt{ sat}). \]

在实现里，通常让位置 \(t\) 的 hidden state \(h_t\) 预测 \(x_{t+1}\)。所以训练数据看起来像一模一样的 input_ids 和 labels，但 loss 内部会做 shift：

\[ \operatorname{logits}_{1:T-1} \rightarrow x_{2:T}. \]

这也是很多初学者第一次训练 GPT 时最容易混乱的地方：labels=input_ids 并不是让模型复制当前 token，而是模型类内部把 logits 和 labels 错开一位。

Theorem: Causal LM Objective Is Maximum Likelihood

For a dataset \(\mathcal{D}=\{x^{(n)}_{1:T_n}\}_{n=1}^N\), minimizing next-token cross entropy is equivalent to maximizing the autoregressive likelihood \[ \prod_{n=1}^{N}\prod_{t=1}^{T_n}p_\theta(x_t^{(n)}\mid x_{<t}^{(n)}). \]

Proof

对单个序列，负对数似然为

\[ -\log p_\theta(x_{1:T}) = -\sum_{t=1}^{T}\log p_\theta(x_t\mid x_{<t}). \]

如果模型在每个位置输出 logits \(z_t\in\mathbb{R}^{|\mathcal{V}|}\)，softmax 给出

\[ p_\theta(v\mid x_{<t}) = \frac{\exp z_{t,v}}{\sum_{u\in\mathcal{V}}\exp z_{t,u}}. \]

对真实 token \(x_t\) 的 cross entropy 是

\[ \operatorname{CE}_t = -\log p_\theta(x_t\mid x_{<t}). \]

所以所有 token 的 cross entropy 求和就是整条序列的 negative log-likelihood。最小化 cross entropy 等价于最大化 likelihood。

GPT-2 as a Concrete Architecture

OpenAI 的 GPT-2 可以看成 GPT-1 的直接放大版本。它保留 decoder-only causal Transformer 的主体，但使用更大的数据、更长上下文、byte-level BPE，并把 LayerNorm 移到每个 sub-block 的输入侧，也就是今天常说的 Pre-LN 风格。

Definition: Decoder-Only Transformer

A decoder-only Transformer is a stack of causal self-attention blocks. At position \(t\), each block can only read positions \(\leq t\), and the final hidden state is projected to vocabulary logits for next-token prediction.

GPT-2 的四个公开尺度可以概括为：

Model	Parameters	Layers	Hidden size \(d\)	Heads	Context
GPT-2 small	117M	12	768	12	1024
GPT-2 medium	345M	24	1024	16	1024
GPT-2 large	762M	36	1280	20	1024
GPT-2 XL	1542M	48	1600	25	1024

这些数字背后的重要结构关系是：

\[ d_h=\frac{d}{H}, \]

其中 \(H\) 是 attention heads 数量，\(d_h\) 是每个 head 的维度。例如 GPT-2 small 中 \(d=768,H=12\)，所以每个 head 是 \(64\) 维。

GPT-2 Config as Tensor Contracts

GPT-2 的配置不是一串展示用数字，而是直接决定 tensor shape 的 contract。以 GPT-2 small 为例：

Field	Value	Tensor implication
`vocab_size`	50,257	token embedding rows and LM-head classes
`n_positions`	1,024	maximum learned absolute position ids
`n_layer`	12	number of repeated Transformer blocks
`n_embd`	768	residual stream width
`n_head`	12	attention heads
`head_dim`	64	`n_embd / n_head`
`n_inner`	3,072	MLP hidden width, usually `4 * n_embd`

所以一个 checkpoint 能否加载，不只看参数总量，还要看这些 shape 是否逐项匹配。比如 n_head 改了但 n_embd 没改，参数矩阵 shape 也许仍然能对上，但 head reshape 的语义变了；vocab_size 改了，embedding 和 LM head 的最后一维直接变了；n_positions 改了，absolute position embedding table 也要改。

Pitfall: Same Hidden Size, Different Head Semantics

For GPT-2-style attention, the projection matrix shape can remain [d, 3d] even if n_head changes, but the reshape into heads changes. Tensor loading may succeed while attention geometry becomes incompatible.

GPT-2 源码里还常见一个容易误解的名字：Conv1D。它不是 convolution，而是把权重存成 [in_dim, out_dim] 的 linear layer wrapper。很多迁移脚本会因此在 PyTorch nn.Linear 和 GPT-2 Conv1D 之间做 transpose。

GPT-2 Conv1D weight: [in_dim, out_dim]
PyTorch Linear weight: [out_dim, in_dim]

这解释了为什么一些 checkpoint conversion 会出现 c_attn.weight.T、c_proj.weight.T 这样的操作。若转置错了，shape 可能还能凑上，但 QKV projection 会完全错。

Parameter Accounting

用 GPT-2 small 估算一层 block 的参数量。设 hidden width 为 \(d\)，MLP width 为 \(m=4d\)。

Attention 部分使用 fused QKV projection：

\[ W_{\text{qkv}}\in\mathbb{R}^{d\times 3d}, \qquad b_{\text{qkv}}\in\mathbb{R}^{3d}. \]

输出 projection：

\[ W_o\in\mathbb{R}^{d\times d}, \qquad b_o\in\mathbb{R}^{d}. \]

MLP 两层：

\[ W_1\in\mathbb{R}^{d\times m}, \qquad W_2\in\mathbb{R}^{m\times d}. \]

两组 LayerNorm 各有 scale 和 bias，所以是 \(4d\) 个参数。单层近似参数量：

\[ \begin{aligned} N_{\text{block}} &= (3d^2+3d) + (d^2+d) + (dm+m) + (md+d) + 4d\\ &= 12d^2+13d \qquad (m=4d). \end{aligned} \]

对 GPT-2 small，\(d=768\)：

\[ N_{\text{block}} = 12\cdot768^2+13\cdot768 \approx 7.09\text{M}. \]

12 层约 \(85\)M。Embedding 参数：

\[ N_{\text{tok}}=50257\cdot768\approx38.6\text{M}, \qquad N_{\text{pos}}=1024\cdot768\approx0.79\text{M}. \]

如果 LM head 与 token embedding tied，就不再额外增加 \(50257\cdot768\) 的输出矩阵。这样可以看出 GPT-2 small 的大头来自两部分：stacked blocks 和 vocabulary embedding。

按这个 state-dict 口径，GPT-2 small 的张量总数会更接近 \(124\)M；论文和模型名称中常见的 “117M” 是历史报告口径。读模型时不要只背名字，最好直接从 config 和 state_dict 求和。

Definition: Parameter Accounting

Parameter accounting means deriving parameter counts from tensor shapes rather than quoting a model size. It is the fastest way to detect config/checkpoint mismatches.

One Forward Pass, With Shapes

假设 batch size \(B=2\)，sequence length \(T=8\)，hidden size \(d=768\)，head 数 \(H=12\)。输入 token ids 是

\[ X\in\{0,\ldots,|\mathcal{V}|-1\}^{B\times T}. \]

Embedding

GPT-2 使用 token embedding 和 absolute position embedding：

\[ h_t^{(0)} = E[x_t]+P[t], \qquad E\in\mathbb{R}^{|\mathcal{V}|\times d}, \qquad P\in\mathbb{R}^{1024\times d}. \]

因此初始 hidden states 的 shape 是

\[ H^{(0)}\in\mathbb{R}^{2\times 8\times 768}. \]

这里 absolute position embedding 会直接影响 padding 策略。右 padding 和左 padding 看似只是把 <pad> 放到哪边，但它们会改变真实 token 拿到的位置编号。

One GPT-2 Block

GPT-2 block 的 Pre-LN 形式可以写成：

\[ \tilde{H}^{(\ell)} = H^{(\ell)} + \operatorname{MHA}\left(\operatorname{LN}(H^{(\ell)})\right), \]

\[ H^{(\ell+1)} = \tilde{H}^{(\ell)} + \operatorname{MLP}\left(\operatorname{LN}(\tilde{H}^{(\ell)})\right). \]

GPT-2 的 MLP 使用 GELU：

\[ \operatorname{MLP}(x) = W_2\operatorname{GELU}(W_1x+b_1)+b_2. \]

常见设置是 \(d_{\text{ff}}=4d\)，所以 GPT-2 small 的 MLP 中间维度是 \(3072\)。

Attention Projection

对每一层，先从 normalized hidden states 生成 \(Q,K,V\)：

\[ Q=HW^Q,\qquad K=HW^K,\qquad V=HW^V, \]

其中

\[ Q,K,V\in\mathbb{R}^{B\times T\times d}. \]

reshape 成 multi-head 后：

\[ Q,K,V\in\mathbb{R}^{B\times H\times T\times d_h} = \mathbb{R}^{2\times12\times8\times64}. \]

每个 head 做 scaled dot-product attention：

\[ A = \frac{QK^\top}{\sqrt{d_h}}+M_{\text{causal}}+M_{\text{pad}}, \]

\[ \operatorname{head} = \operatorname{softmax}(A)V. \]

causal mask 是下三角矩阵：

\[ M_{\text{causal},ij} = \begin{cases} 0, & j\leq i,\\ -\infty, & j>i. \end{cases} \]

这样第 \(i\) 个位置永远不能读第 \(i+1,i+2,\ldots\) 个位置。

GPT-2 实现里通常不是分别存 Wq/Wk/Wv，而是用一个 fused projection 一次得到三者：

qkv = c_attn(x)              # [B, T, 3 * d]
q, k, v = qkv.split(d, dim=-1)
q = q.view(B, T, H, Dh).transpose(1, 2)
k = k.view(B, T, H, Dh).transpose(1, 2)
v = v.view(B, T, H, Dh).transpose(1, 2)

transpose(1, 2) 后的 layout 是 [B, H, T, Dh]。如果后续 kernel 要求 contiguous，可能需要 .contiguous() 或使用支持 stride 的 attention kernel。教学公式里这些都是同一个张量，工程里 stride/layout 会直接影响速度和 kernel 选择。

Contract: GPT-2 Fused QKV Split

The fused c_attn output must be split as [q, k, v] along the last dimension before reshaping into heads. Splitting after head reshape or using the wrong order silently changes the attention computation.

Proof: Why Adding \(-\infty\) Masks Future Tokens

softmax 的第 \(j\) 项是

\[ \alpha_j = \frac{\exp(a_j)}{\sum_k\exp(a_k)}. \]

如果未来位置 \(j>i\) 的 logit 被替换为 \(a_j-\infty\)，则

\[ \exp(a_j-\infty)=0. \]

所以被 mask 的位置 attention probability 为 \(0\)。这不是“权重很小”，而是在数学上从归一化分布中删掉了未来 token。

LM Head and Weight Tying

最后一层 hidden states 经过 final LayerNorm，再投影到 vocabulary：

\[ z_t = h_t^{(L)}W_E^\top, \qquad z_t\in\mathbb{R}^{|\mathcal{V}|}. \]

这里常见做法是 weight tying：输出层的权重与 token embedding 权重共享。直觉上，embedding matrix 同时定义“如何读 token”和“如何写 token 分布”。

对位置 \(t\)，模型输出的是下一个 token 的分布：

\[ p_\theta(x_{t+1}=v\mid x_{\leq t}) = \operatorname{softmax}(z_t)_v. \]

因此一个长度 \(T\) 的训练样本通常只对前 \(T-1\) 个 logits 计算 loss：

\[ \mathcal{L} = -\sum_{t=1}^{T-1} \log p_\theta(x_{t+1}\mid x_{\leq t}). \]

如果序列末尾显式加入 <eos>，则模型也会学习什么时候结束。

Derivation: Softmax Cross-Entropy Gradient

令

\[ p_i=\frac{e^{z_i}}{\sum_j e^{z_j}}, \qquad \mathcal{L}=-\log p_y. \]

则

\[ \mathcal{L} = -z_y+\log\sum_j e^{z_j}. \]

对 logit \(z_i\) 求导：

\[ \frac{\partial\mathcal{L}}{\partial z_i} = -\mathbf{1}[i=y] + \frac{e^{z_i}}{\sum_j e^{z_j}} = p_i-\mathbf{1}[i=y]. \]

所以 cross entropy 的梯度会把真实 token 的 logit 往上推，把模型当前过高估计的错误 token 往下压。这个推导很重要，因为它解释了为什么 next-token prediction 可以通过普通的分类 loss 训练。

Weight Tying Gradient Flow

如果 LM head 使用 tied embedding，则同一个矩阵 \(E\) 同时出现在输入和输出：

\[ h_t^{(0)}=E[x_t]+P[t], \qquad z_t=h_t^{(L)}E^\top. \]

因此 \(E\) 的梯度有两条来源：

input-side gradient: 某个 token 作为上下文被读入时，对对应 embedding row 的梯度；
output-side gradient: 每个预测位置的 softmax CE 对所有 vocabulary rows 的梯度。

这就是为什么词频会影响 embedding 学习。高频 token 经常作为输入，也经常作为候选输出参与 softmax；低频 token 的 input-side 更新少，但 output-side 仍会在每个 softmax 中收到与模型概率相关的梯度。实际训练中，optimizer state、weight decay、embedding tying、vocab size 都会影响这张表的学习动态。

Pitfall: Resizing Tied Embeddings Must Preserve Tying

After adding tokens or pad ids, resizing embeddings must keep the LM head and input embedding tied if the checkpoint expects tying. Otherwise input and output token spaces drift apart.

GPT-2 Input Representation

GPT-2 使用 byte-level BPE，词表大小为 \(50{,}257\)。这件事比“用了 BPE”更重要，因为 byte-level BPE 解决了两个问题：

不需要 <unk> 才能表示任意 Unicode 字符串；
又比纯 byte-level LM 更短、更容易建模常见词片段。

举例：

unbelievable

可能被拆成类似

un | believable

而罕见字符、emoji、混合语言文本也仍然可以退回到 byte 级别表达。对 WebText 这种开放网页语料，这种 tokenizer 比固定 word vocabulary 更稳。

Pitfall: Token Is Not Word

LLM 的一步预测是预测 token，不是预测自然语言里的 word。一个英文单词可能是一个 token，也可能是多个 token；中文、代码和罕见字符串的切分更不直观。

Training Dataset and Objective

GPT-2 的训练语料 WebText 来自经过人类链接筛选的网页，目标是让语言模型在大规模自然文本中学到任务演示。关键思想是：很多 NLP 任务可以被写成一个自然语言序列。

例如翻译可以写成：

Translate English to French:
English: I love cats.
French: J'aime les chats.

问答可以写成：

Passage: ...
Question: ...
Answer: ...

如果模型在这类文本上做 next-token prediction，它并没有显式被告知“这是翻译任务”或“这是问答任务”，但为了预测后续文本，它必须学会从上下文推断任务格式。

Theorem: Supervised Completion Is a Subset of Causal LM

Suppose a supervised example is serialized as a sequence \[ s=(\text{instruction},\text{input},\text{output}). \] The supervised conditional objective on output tokens is a subset of the causal language modeling objective on the whole serialized sequence.

Proof

设输出 token 占据位置 \(a,\ldots,b\)。监督式 completion objective 是

\[ \mathcal{L}_{\text{sup}} = -\sum_{t=a}^{b} \log p_\theta(x_t\mid x_{<t}). \]

完整 causal LM objective 是

\[ \mathcal{L}_{\text{lm}} = -\sum_{t=1}^{T} \log p_\theta(x_t\mid x_{<t}). \]

显然 \(\mathcal{L}_{\text{sup}}\) 是 \(\mathcal{L}_{\text{lm}}\) 中对输出位置的一部分求和。差异在于 LM 还会学习 instruction 和 input 的分布；监督 fine-tuning 常常用 loss mask 只保留 output token 的 loss。

这就是 GPT-2 “unsupervised multitask learner” 叙事的核心：任务不是写在 architecture 里，而是写在文本格式里。

WebText Blocks and Sample Construction

GPT-2 pretraining 更接近“把大量文档 token stream 切成长度不超过 1024 的 blocks”，而不是每条网页单独 padding 成 batch。概念上：

doc_a tokens <eos> doc_b tokens <eos> doc_c tokens ...
-> blocks of length 1024

这样做的好处是几乎没有 padding，GPU token 利用率高；代价是一个 block 可能跨文档边界。<eos> 在这里同时承担两个角色：

作为 sequence end token，教模型何时停止；
作为文档边界提示，告诉模型后面可能开始新上下文。

如果训练管线改成 instruction examples，就不能简单沿用这种做法。指令数据里 prompt 和 answer 有明确监督边界，常常需要只对 assistant span 计 loss；多个 conversation packing 到同一 block 时，还要决定是否允许跨 conversation attention。

Definition: Training Block

A training block is the fixed-length token window passed to the model during pretraining. It may be a slice of a long token stream, a padded example, or a packed group of shorter examples.

一个常见预训练 block builder 可以写成：

def build_blocks(docs, eos_id, block_size):
    stream = []
    for doc in docs:
        stream.extend(doc)
        stream.append(eos_id)

    blocks = []
    for start in range(0, len(stream) - block_size, block_size):
        ids = stream[start : start + block_size]
        blocks.append({"input_ids": ids, "labels": ids.copy()})
    return blocks

这个代码省略了 shuffle、document sampling、remainder handling 和 distributed sharding，但它表达了 GPT-style pretraining 的核心：目标来自 token stream 的下一位，而不是来自手工 label。

Label Shift, Attention Mask, and Loss Mask

训练 decoder-only LM 时最容易混淆三种 mask/shift：

Mechanism	Operates on	Purpose
causal mask	attention logits	禁止看未来 token
attention mask / padding mask	attention logits	禁止读 padding token
label mask	loss	不对 padding 或 prompt token 计 loss

Label Shift

给定

input_ids = [A, B, C, D]
labels    = [A, B, C, D]

模型内部实际计算通常是：

shift_logits = logits[:, :-1, :]
shift_labels = labels[:, 1:]

也就是：

Hidden state	Predicts
\(h_A\)	\(B\)
\(h_B\)	\(C\)
\(h_C\)	\(D\)

最后一个 hidden state 没有下一个 label，因此不参与这条序列的 next-token loss，除非你在末尾加 <eos>。

Label Mask

在 Hugging Face/PyTorch 训练里，常用 -100 表示 ignore index：

input_ids = [A, B, C, <pad>, <pad>]
labels    = [A, B, C, -100,  -100]

shift 后，任何 label 为 -100 的位置都不会进入 loss。这和 attention mask 不一样：attention mask 控制“模型能看什么”，label mask 控制“哪些位置反传 loss”。

Pitfall: Masking Attention Is Not Masking Loss

If a padding token is hidden from attention but its label is still a real class id, the model will still be trained to predict padding. Correct causal LM batching usually needs both an attention mask and a label mask.

Padding: The Two Sides That Matter

GPT-2 原始 tokenizer 没有独立的 pad token。实践中常见做法是把

tokenizer.pad_token = tokenizer.eos_token

然后用 attention_mask 和 labels=-100 区分“真实的 eos”和“为了 batching 填出来的 pad”。这看起来别扭，但它在 GPT-2 系列上很常见。

更重要的是：training 通常右 padding，batched generation 常常左 padding。这不是口味问题，而是由 absolute position embedding 和 generation API 的 next-token 读取方式共同决定的。

Right Padding for Training

训练时，一个 batch 中样本长度不同，通常右 padding：

row 1: [A, B, C, D, <eos>]
row 2: [E, F, <eos>, <pad>, <pad>]

attention mask 是：

row 1: [1, 1, 1, 1, 1]
row 2: [1, 1, 1, 0, 0]

labels 是：

row 1: [A, B, C, D, <eos>]
row 2: [E, F, <eos>, -100, -100]

右 padding 的好处是，真实 token 的 position ids 与单独输入时一致：

row 2 real positions: E -> 0, F -> 1, <eos> -> 2

这对 GPT-2 很重要，因为 GPT-2 用 absolute position embedding。如果改成左 padding 而不重算 position ids，E 可能会被放到 position 2，F 被放到 position 3。模型看到的就不是原来的短句，而是“前面空了两个位置之后的短句”。

Left Padding for Batched Generation

生成时，尤其使用批量 prompts：

prompt 1: [A, B, C, D]
prompt 2: [E, F]

如果右 padding：

row 1: [A, B, C, D]
row 2: [E, F, <pad>, <pad>]

许多 generation loop 会取最后一列 logits：

next_logits = logits[:, -1, :]

对 row 2 来说，最后一列对应的是 <pad> 位置，而不是 F 之后的位置。这会导致短 prompt 的生成条件错位。

所以 batched generation 通常左 padding：

row 1: [A, B, C, D]
row 2: [<pad>, <pad>, E, F]

这样每一行的最后一列都是真实 prompt 的末尾，logits[:, -1, :] 才是“下一个 token”的分布。

但是 GPT-2 的 absolute position embedding 又带来一个额外要求：左 padding 时必须修正 position ids，让真实 token 从 \(0\) 开始编号：

attention_mask row 2: [0, 0, 1, 1]
naive positions:      [0, 1, 2, 3]
correct positions:    [0, 0, 0, 1]

可以用下面的逻辑构造：

position_ids = attention_mask.long().cumsum(dim=-1) - 1
position_ids = position_ids.masked_fill(attention_mask == 0, 0)

这使得 row 2 的 E,F 仍然对应 position \(0,1\)。如果不这么做，GPT-2 会把短 prompt 的真实 token 放到更大的 absolute position 上，效果会变差。

Definition: Padding Side

Right padding appends pad tokens after the real sequence. Left padding prepends pad tokens before the real sequence. For decoder-only models, right padding is convenient for training, while left padding is often convenient for batched generation.

Padding Choice Summary

Scenario	Preferred padding	Extra requirements
GPT-2 causal LM training	right padding	set pad labels to `-100`; pass attention mask
GPT-2 batched generation	left padding	recompute position ids from attention mask
right-padded generation	possible but fragile	gather logits at last non-pad index, not `-1`
packed pretraining blocks	no pad inside block	use `<eos>` or document-aware block mask

Safe Right-Padded Generation

右 padding 也可以生成，只是不能偷懒取 logits[:, -1, :]。应该取每行最后一个非 pad 位置：

out = model(input_ids=input_ids, attention_mask=attention_mask)
last_idx = attention_mask.long().sum(dim=-1) - 1
batch_idx = torch.arange(input_ids.shape[0], device=input_ids.device)
next_logits = out.logits[batch_idx, last_idx, :]

这段代码让右-padded prompt 的短样本也从最后一个真实 token 后继续生成。问题是 decode loop 追加新 token 后，batch 内不同样本的位置管理会变复杂；因此很多推理框架仍选择左 padding，让“最后一列就是最新真实 token”成为批量不变量。

Generation Logit Contract

For decoder-only generation, the logits used for sampling must correspond to the last real prompt token, not merely the last tensor column.

Padding, Packing, and Document Boundaries

There are two common ways to make efficient training batches.

Pad-to-Max-Length

假设 batch 内最长样本长度是 \(T_{\max}\)，所有样本 pad 到 \(T_{\max}\)。优点是简单，缺点是浪费计算：

\[ \text{wasted fraction} = 1-\frac{\sum_i T_i}{B T_{\max}}. \]

如果 lengths 是 \([1024,100,90,80]\)，那么大部分 attention 计算都花在 pad 上。

Sequence Packing

另一种做法是把多个短文档拼成固定长度 block：

doc1 <eos> doc2 <eos> doc3 <eos>

这样几乎没有 padding，GPU 利用率高。问题是，如果只用普通 causal mask，doc3 的 token 可以 attend 到 doc1/doc2。这有两种处理方式：

accept it: 用 <eos> 作为边界，让模型自己学习跨文档边界后应该重新开始；
block it: 加 document-aware block-diagonal mask，禁止跨文档 attention。

很多预训练管线采用第一种，因为它简单且高效；但在指令微调或严格样本独立的训练里，第二种更干净。

Pitfall: Packing Changes the Effective Task

Packing is not just an implementation trick. If examples can attend across boundaries, the model is trained on a slightly different distribution than independent examples.

Minimal Training Step

下面是一段概念性 PyTorch 伪代码，展示 GPT-2 style causal LM training 的关键点：

batch = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=1024,
    return_tensors="pt",
)

input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]

labels = input_ids.clone()
labels[attention_mask == 0] = -100

out = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    labels=labels,
)

loss = out.loss
loss.backward()
optimizer.step()

这段代码里真正重要的是：

attention_mask 让 padding 不被读；
labels=-100 让 padding 不贡献 loss；
模型内部完成 shift；
tokenizer 的 padding side 要和训练/生成场景一致；
如果 pad_token=eos_token，必须依赖 mask 来区分真实 eos 和 pad。

Minimal GPT-2 Block Skeleton

把前面的结构写成一个最小 block，可以更清楚看到 residual、LayerNorm、fused QKV 和 MLP 的顺序：

class GPT2Block(nn.Module):
    def __init__(self, d, n_head, dropout_p):
        super().__init__()
        self.ln_1 = nn.LayerNorm(d)
        self.attn = CausalSelfAttention(d, n_head, dropout_p)
        self.ln_2 = nn.LayerNorm(d)
        self.mlp = nn.Sequential(
            nn.Linear(d, 4 * d),
            NewGELU(),
            nn.Linear(4 * d, d),
            nn.Dropout(dropout_p),
        )

    def forward(self, x, attention_mask, position_ids=None, cache=None):
        x = x + self.attn(self.ln_1(x), attention_mask, cache=cache)
        x = x + self.mlp(self.ln_2(x))
        return x

真实实现还会包含 attention dropout、residual dropout、causal bias buffer、past key/value cache、head mask 等细节。但这个 skeleton 把 GPT-2 的核心顺序固定下来：norm -> sublayer -> residual，重复两次。

Pitfall: Post-LN and Pre-LN Are Not Drop-In Equivalent

Moving LayerNorm after residual changes gradient flow and checkpoint semantics. GPT-2-style blocks use pre-sublayer normalization plus a final ln_f after the last block.

KV Cache and Why Decoder-Only Generates Efficiently

训练时，序列 \(x_{1:T}\) 的所有位置可以并行算。生成时，token 是一个个出来的：

\[ x_{T+1}\sim p_\theta(\cdot\mid x_{\leq T}), \qquad x_{T+2}\sim p_\theta(\cdot\mid x_{\leq T+1}). \]

如果每一步都重新计算整个 prefix，复杂度很高。KV cache 的做法是缓存每一层历史 token 的 \(K,V\)：

\[ K_{\leq T}^{(\ell)},V_{\leq T}^{(\ell)}. \]

下一步只为新 token 计算 \(q_{T+1},k_{T+1},v_{T+1}\)，然后让它 attend 到缓存中的历史 keys/values：

\[ \operatorname{Attn}(q_{T+1},[K_{\leq T};k_{T+1}],[V_{\leq T};v_{T+1}]). \]

这把每步生成的重复计算从“重算所有历史层表示”变成“只追加一个 token 的 KV”。显存代价是 cache 随 batch size、layer 数、head 数和上下文长度线性增长：

\[ \text{KV elements} = 2\cdot B\cdot L\cdot H\cdot T\cdot d_h. \]

这正是后面 inference infrastructure 里讨论 PagedAttention、KV cache 管理、碎片化 GPU 服务的原因。

Position IDs with Cache

GPT-2 使用 learned absolute position embedding，因此 decode step 必须知道新 token 的 absolute position。prefill 阶段：

input ids:    [A, B, C]
position ids: [0, 1, 2]

decode 第一个新 token 时，它的位置应该是 3：

new token:    [D]
position id:  [3]

若 batch 中有 padding，position 应该按每条样本真实长度递增，而不是按 padded tensor 列号递增。一个最小写法是：

past_len = attention_mask.long().sum(dim=-1, keepdim=True)
next_position_ids = past_len

生成多步后，past_len 要随着每条样本追加 token 而更新；遇到 EOS 后还要决定该样本是否继续占位、是否停止采样、是否释放 cache。这些看起来是 serving 细节，但对 GPT-2 absolute position embedding 来说，它们直接决定模型读到的位置向量。

Pitfall: Cache Length and Position ID Can Diverge

In padded batched decoding, cache length, tensor column index, and true sequence length are not always the same. GPT-2-style absolute position ids must follow true generated length.

Why GPT-2 Matters for Modern LLMs

GPT-2 留下来的不是某个固定大小的模型，而是一套 template：

把所有任务都序列化成文本；
用 causal LM 统一训练目标；
用 decoder-only Transformer 做可扩展参数化；
用 tokenizer 把开放字符串映射到有限词表；
用 prompt 在 inference time 指定任务；
用 sampling 和 KV cache 把概率模型变成交互式生成器。

后来的 LLaMA、Qwen、Mistral、GPT-NeoX 等模型改变了很多细节：RoPE 替代 absolute position embedding，SwiGLU/RMSNorm 替代部分 GPT-2 组件，更大的数据和更复杂的 post-training。但“decoder-only causal LM”这个骨架仍然是主流。

Reading GPT-2 Correctly

GPT-2 should be read as a working bridge from classical language modeling to modern LLMs: its architecture is simple enough to derive on paper, but its training recipe already contains the key engineering details that make later LLMs possible.

References

Language Models are Unsupervised Multitask Learners, Radford et al., OpenAI.
OpenAI GPT-2 source code, especially the causal attention mask and KV cache implementation.
Hugging Face GPT-2 documentation, especially padding, past_key_values, and label shifting conventions.