4.7 Qwen3 Technical Report

Qwen3 是现代开源 LLM 的一个很好的讲义案例。它不是只给出一个更大的 decoder-only Transformer，而是把几条现代路线放在同一个系统里：dense scaling、MoE scaling、GQA/RoPE/RMSNorm/SwiGLU/QK-Norm、long-context adaptation、thinking/non-thinking 双模式、reasoning RL、strong-to-weak distillation 和 serving-time reasoning parser。

读这种 technical report，最重要的不是背 benchmark，而是拆成四层：

architecture: 一个 token forward 到底经过哪些模块；
pretraining: 数据分布、阶段和 context length 如何变化；
post-training: 什么行为由 SFT/RL/distillation 写进模型；
inference protocol: chat template、thinking tags、budget、parser 和 serving runtime 如何把模型能力暴露出来。

Definition: Technical Report Reading

A model technical report should be read as a design document: architecture explains compute and memory, data explains capability pressure, post-training explains behavior, and inference protocol explains how the behavior is accessed.

Model Family: Dense and MoE

Qwen3 系列包含 6 个 dense models 和 2 个 MoE models：

Family	Models
Dense	0.6B, 1.7B, 4B, 8B, 14B, 32B
MoE	30B-A3B, 235B-A22B

A3B 和 A22B 表示每个 token forward 时激活的参数量。MoE 的核心是让 total capacity 很大，但 per-token compute 只经过少数 experts。

Definition: Activated Parameters

Activated parameters are the parameters used for one token’s forward pass. In MoE models, total parameters can be much larger than activated parameters because only routed experts are evaluated for each token.

这带来一个直接的系统含义：dense model 的权重访问和计算量大致随总参数量增长；MoE model 的存储仍要容纳全部 experts，但每个 token 的矩阵计算只走 activated experts。MoE 因此更像“用显存换计算效率和容量”。

Decoder-Only Skeleton

Qwen3 仍然是 causal decoder-only LM：

\[ x_{1:T} \rightarrow E[x_{1:T}] \rightarrow \text{causal Transformer blocks} \rightarrow z_{1:T} \rightarrow p_\theta(x_{t+1}\mid x_{\leq t}). \]

一层 dense Transformer block 可以抽象为：

\[ \tilde{h} = h+\operatorname{Attn}(\operatorname{RMSNorm}(h)), \]

\[ h' = \tilde{h} + \operatorname{SwiGLU}(\operatorname{RMSNorm}(\tilde{h})). \]

MoE 版本通常把 FFN/SwiGLU 子层替换为 routed expert mixture：

\[ h' = \tilde{h} + \operatorname{MoE}(\operatorname{RMSNorm}(\tilde{h})). \]

这说明 Qwen3 和 GPT-2 的关系是“同一个自回归骨架，不同的现代组件组合”。GPT-2 用 learned absolute position、MHA、LayerNorm、GELU FFN；Qwen3 使用 RoPE、GQA、RMSNorm、SwiGLU、QK-Norm，并在 MoE 版本中加入专家路由。

Architecture Components

Component	Role	Why it matters
RoPE	encode relative position through rotations	supports length extrapolation tricks better than learned absolute table
GQA	fewer KV heads than query heads	reduces KV cache memory and decode bandwidth
RMSNorm	scale-only normalization	cheaper and common in modern LLMs
Pre-Norm	normalize before sublayer	improves gradient flow in deep Transformers
SwiGLU	gated FFN nonlinearity	stronger FFN capacity than vanilla GELU MLP
QK-Norm	normalize attention query/key	stabilizes attention logits
no QKV bias	remove bias terms in Q/K/V projections	simplifies projections and follows Qwen3 design

Definition: Grouped Query Attention

Grouped Query Attention uses more query heads than key/value heads. Several query heads share one key/value head, reducing KV cache memory and bandwidth.

If query heads are \(H_q\), KV heads are \(H_{kv}\), head dimension is \(d_h\), and active context length is \(T\), one layer of KV cache has

\[ 2BT H_{kv}d_h \]

elements. Standard MHA has \(H_{kv}=H_q\); GQA has \(H_{kv}<H_q\)。因此 KV cache memory ratio is

\[ \frac{M_{\text{GQA}}}{M_{\text{MHA}}} = \frac{H_{kv}}{H_q}. \]

For Qwen3-32B, the model card lists \(H_q=64\) and \(H_{kv}=8\), so KV cache per layer is roughly \(1/8\) of the MHA version. This is not a cosmetic architecture choice; it directly determines serving capacity.

Dense Model Shapes

Qwen3 dense models:

Model	Layers	Heads Q/KV	Tie Embedding	Context
Qwen3-0.6B	28	16 / 8	Yes	32K
Qwen3-1.7B	28	16 / 8	Yes	32K
Qwen3-4B	36	32 / 8	Yes	32K
Qwen3-8B	36	32 / 8	No	128K
Qwen3-14B	40	40 / 8	No	128K
Qwen3-32B	64	64 / 8	No	128K

这里可以看到两个 scaling pattern：

层数随模型变大而上升，32B 到 64 layers；
KV heads 基本保持 8，query heads 随模型增大。

第二点很关键。增加 query heads 提高 attention 表达能力；保持较少 KV heads 控制 inference KV cost。现代 LLM 的架构不是只为训练 loss 设计，也为 serving memory 设计。

Pitfall: Context Length Numbers Need Interpretation

Reports, model cards, and serving configs may distinguish native training length, architecture table length, and YaRN-extended inference length. Always ask which one is being used in a benchmark or deployment.

Reading Shape as a Memory Bill

表里的 heads 不是装饰数字。以一层 KV cache 为例，如果 batch 为 \(B\)，当前缓存长度为 \(T\)，KV heads 为 \(H_{kv}\)，head dimension 为 \(d_h\)，dtype bytes 为 \(s\)，则一层 KV cache 近似为：

\[ M_{\text{KV, layer}} = 2BT H_{kv}d_hs. \]

总 KV cache 还要乘 layers：

\[ M_{\text{KV,total}} = 2BT H_{kv}d_hsL. \]

所以 Qwen3-32B 的 64 / 8 heads 设计表示：query 表达能力按 64 heads 做，但 KV cache 只按 8 heads 存。若它使用普通 MHA，\(H_{kv}=H_q=64\)，KV cache 会是 GQA 版本的 8 倍。

这也是为什么 long-context 模型常常先卡在 serving，而不是参数显存。权重是固定成本，KV cache 是随并发、上下文长度和 decode token 数线性增长的动态成本：

Quantity	Scales with	Serving meaning
weights	parameters	model placement and tensor parallelism
KV cache	\(B\times T\times L\times H_{kv}\times d_h\)	admission control and batching
attention FLOPs	roughly \(B\times T^2\) for prefill	long-prompt latency
decode FLOPs	roughly \(B\times T\) per token	streaming latency

读 technical report 时，看到 128K context 不应该只理解成“能读很长”；还要立刻问：这个长度下 prefill 多慢、KV cache 多大、并发降到多少、是否需要 paged KV 或 chunked prefill。

From Config Fields to Tensor Shapes

读 Qwen3 这类 model card 时，最应该训练的能力是把配置字段翻译成 forward 里的张量。假设 hidden size 为 \(d_{\text{model}}\)，query heads 为 \(H_q\)，KV heads 为 \(H_{kv}\)，head dimension 为

\[ d_h=\frac{d_{\text{model}}}{H_q}. \]

一个 decoder layer 的 attention projections 通常产生：

hidden_states: [B, T, d_model]
q:             [B, T, H_q,  d_h]
k:             [B, T, H_kv, d_h]
v:             [B, T, H_kv, d_h]

GQA 的关键是把 \(H_{kv}\) 个 key/value heads 复用给 \(H_q\) 个 query heads。如果

\[ G=\frac{H_q}{H_{kv}}, \]

则每个 KV head 服务 \(G\) 个 query heads。实现时可以显式 repeat：

def repeat_kv(x, groups):
    # x: [B, T, H_kv, d_h]
    bsz, seqlen, kv_heads, head_dim = x.shape
    x = x[:, :, :, None, :].expand(bsz, seqlen, kv_heads, groups, head_dim)
    return x.reshape(bsz, seqlen, kv_heads * groups, head_dim)

但高性能 kernel 往往不会真的 materialize repeated KV，而是在 attention kernel 内部根据 query head id 映射到 KV head id：

\[ h_{kv}=\left\lfloor\frac{h_q}{G}\right\rfloor. \]

这就是为什么 GQA 同时影响内存布局和 kernel indexing。若把 GQA 当成普通 MHA 写成 repeat_interleave，代码简单但可能多用 \(G\) 倍 KV bandwidth。

QK-Norm 可以放在 RoPE 之前或之后，取决于实现约定；核心检查是 attention logits 的尺度是否稳定：

q = q_proj(hidden).view(B, T, H_q, d_h)
k = k_proj(hidden).view(B, T, H_kv, d_h)
v = v_proj(hidden).view(B, T, H_kv, d_h)

q = q_norm(q)
k = k_norm(k)
q, k = apply_rope(q, k, position_ids)
attn_out = gqa_attention(q, k, v, causal_mask, kv_cache)

Pitfall: Shape Compatibility Is Not Semantic Compatibility

A checkpoint can load with matching tensor ranks but still be semantically wrong if RoPE scaling, QK-Norm placement, GQA head grouping, or tokenizer ids differ from the original implementation.

One Qwen3-Style Decoder Block in Tensor Form

把一层写成接近实现的形式，可以看清每个组件到底在哪里发挥作用。设输入 \(h\in\mathbb{R}^{B\times T\times d}\)，一层 dense block 可以写成：

def qwen3_dense_block(h, pos, cache, mask):
    x = rms_attn(h)

    q = q_proj(x).view(B, T, H_q, d_h)
    k = k_proj(x).view(B, T, H_kv, d_h)
    v = v_proj(x).view(B, T, H_kv, d_h)

    q = q_norm(q)
    k = k_norm(k)
    q, k = apply_rope(q, k, pos)

    attn = gqa_attention(q, k, v, cache, mask)
    h = h + o_proj(attn.reshape(B, T, d))

    x = rms_mlp(h)
    gate = gate_proj(x)
    up = up_proj(x)
    ff = down_proj(silu(gate) * up)
    h = h + ff
    return h

MoE block 只替换第二个 residual branch：

def qwen3_moe_block(h, pos, cache, mask):
    h = h + attention_branch(rms_attn(h), pos, cache, mask)
    x = rms_mlp(h)
    h = h + routed_experts(x)
    return h

这个伪代码有三层检查价值：

shape check: GQA 的 \(H_q\) 和 \(H_{kv}\) 必须在 attention 内正确映射；
state check: cache 只缓存 K/V，不缓存 Q，也不缓存 MLP 中间量；
scale check: RMSNorm、QK-Norm、RoPE、attention scale 的顺序必须和 checkpoint 训练时一致。

Definition: Decoder Block Contract

A decoder block contract specifies tensor shapes, normalization order, cache inputs/outputs, residual branches, and whether the FFN branch is dense or routed.

很多“自己实现 Qwen-like 模型”的错误不是线性层维度错，而是 contract 细节错：比如把 QK-Norm 放错位置、真的 materialize repeated KV、decode 时忘记用 cache position 做 RoPE、或把 MoE router 的 logits 当成普通分类 loss 监控。

MoE Model Shapes

Qwen3 MoE models:

Model	Layers	Heads Q/KV	Experts total/activated	Context
Qwen3-30B-A3B	48	32 / 4	128 / 8	128K
Qwen3-235B-A22B	94	64 / 4	128 / 8	128K

每个 token 从 128 个 experts 中激活 8 个。对 hidden state \(x\)，router 输出 logits：

\[ r(x)=W_rx. \]

取 top-\(k\) experts：

\[ S(x)=\operatorname{TopK}(r(x), k). \]

gate weights:

\[ g_e(x) = \frac{\exp(r_e(x))} {\sum_{j\in S(x)}\exp(r_j(x))}, \qquad e\in S(x). \]

MoE FFN 输出：

\[ \operatorname{MoE}(x) = \sum_{e\in S(x)} g_e(x)E_e(x). \]

Definition: Expert Load

Expert load is the fraction of tokens routed to each expert during a batch or global batch. Balanced load is needed for efficient MoE training and serving.

如果 router 把大部分 tokens 送到少数 experts，系统会出现两个问题：热门 expert 过载，冷门 expert 学不到东西。Qwen3 报告强调 global-batch load balancing loss，就是为了在更大 token 统计上约束 expert load。

一种常见的 load-balancing 形状是：

\[ \mathcal{L}_{\text{balance}} \propto N_E \sum_{e=1}^{N_E} f_e p_e, \]

其中 \(f_e\) 是 routed-to expert \(e\) 的 token fraction，\(p_e\) 是 router 对 expert \(e\) 的平均 probability。这个项鼓励 tokens 和 router probability 不要集中到少数 experts。

MoE Forward as a Dispatch Problem

MoE 的数学式

\[ \operatorname{MoE}(x) = \sum_{e\in S(x)}g_e(x)E_e(x) \]

在代码里不是一个普通 Linear。它要做 token dispatch：

tokens -> router logits -> top-k experts
       -> group tokens by expert
       -> run expert FFNs
       -> scatter weighted outputs back

一个 batch 有 \(N=BT\) 个 token，top-\(k\) 路由会产生 \(Nk\) 个 token-expert assignments。实现里常见张量形状是：

router_logits: [N, num_experts]
topk_ids:      [N, k]
topk_weight:   [N, k]
expert_input:  ragged groups by expert
expert_output: ragged groups by expert

为了上 GPU，ragged groups 通常会被 flatten 成连续 buffer，并记录 expert_offsets：

flat_expert_ids = topk_ids.reshape(-1)
flat_token_ids = repeat_arange(num_tokens, repeats=k)
order = flat_expert_ids.argsort()

sorted_experts = flat_expert_ids[order]
sorted_tokens = flat_token_ids[order]

接下来每个 expert 处理属于自己的 token slice。expert parallelism 下，这一步还可能跨 GPU 做 all-to-all。于是 MoE 的瓶颈不只是 FLOPs，而是 route balance、dispatch bandwidth、expert placement 和 all-to-all overlap。

Pitfall: Activated Parameters Do Not Equal Latency

Activated parameters estimate per-token expert compute, but MoE latency also depends on dispatch, all-to-all communication, load balance, kernel fusion, and how many experts are colocated on each device.

Capacity, Load, and Dispatch Diagnostics

MoE routing 在数学上是 top-\(k\)，在系统里还要面对 capacity。设一个 global batch 有 \(N=BT\) 个 tokens，expert 数为 \(E\)，每个 token 激活 \(k\) 个 experts。若负载完全均匀，每个 expert 收到的 assignment 数约为：

\[ \frac{Nk}{E}. \]

训练系统常用 capacity factor \(\gamma\) 预留余量：

\[ C_e = \left\lceil \gamma\frac{Nk}{E} \right\rceil. \]

若某个 expert 收到超过 \(C_e\) 个 assignments，就必须选择策略：drop overflow、reroute 到备选 expert、增大 capacity，或接受动态 ragged kernel 的开销。Qwen3 报告强调 global-batch load balancing；这类 loss 的目的，就是让真实 load 尽量接近上面的均匀预算。

Definition: MoE Capacity Factor

The MoE capacity factor is a multiplier that sets per-expert token capacity above the ideal balanced load. Larger capacity reduces overflow but increases memory and padding work.

一个可监控的 MoE training step 至少应该记录：

Metric	Formula / shape	What it catches
`load[e]`	assignments routed to expert \(e\)	hot/cold experts
`importance[e]`	mean router probability for expert \(e\)	router collapse before hard routing
`overflow_rate`	dropped assignments / total assignments	capacity too small or routing imbalanced
`router_entropy`	\(-\sum_e p_e\log p_e\)	overly sharp or overly flat router
`all_to_all_bytes`	dispatch + combine traffic	communication bottleneck

伪代码可以写成：

def moe_metrics(router_probs, topk_ids, num_experts):
    # router_probs: [N, E], topk_ids: [N, k]
    flat = topk_ids.reshape(-1)
    load = flat.bincount(minlength=num_experts).float()
    load = load / load.sum().clamp_min(1)

    importance = router_probs.mean(dim=0)
    entropy = -(importance * importance.clamp_min(1e-12).log()).sum()
    return {
        "load_max": load.max(),
        "load_min": load.min(),
        "load_std": load.std(),
        "router_entropy": entropy,
    }

这里的 load_std 经常比 loss 更早暴露问题：loss 可能还在下降，但某几个 experts 已经承担了大多数 tokens。对 MoE 来说，训练稳定性和系统吞吐不是两件事；router 分布就是二者交汇的地方。

Qwen3 MoE 去掉 shared experts，并采用 128 total / 8 activated experts。去掉 shared expert 的系统含义是：每个 token 的 FFN capacity 更依赖 router 选择，load-balancing 和专家专化更重要；serving 时也少了一个“所有 token 必经”的 dense FFN 分支。

Pitfall: MoE Saves Activated Compute, Not System Complexity

MoE reduces per-token activated compute, but introduces routing, load balancing, expert parallelism, communication, checkpointing, and serving-placement problems.

QK-Norm and Attention Stability

标准 scaled dot-product attention 使用

\[ a_{ij} = \frac{q_i^\top k_j}{\sqrt{d_h}}. \]

如果 \(\|q_i\|\) 或 \(\|k_j\|\) 变大，attention logits 会放大，softmax 可能饱和。Softmax 饱和后，attention distribution 接近 one-hot，梯度会集中甚至不稳定。

QK-Norm 先归一化 query/key：

\[ \hat{q}_i = \frac{q_i}{\|q_i\|+\epsilon}, \qquad \hat{k}_j = \frac{k_j}{\|k_j\|+\epsilon}, \]

然后计算

\[ a_{ij} = \frac{\hat{q}_i^\top \hat{k}_j}{\sqrt{d_h}}. \]

如果使用 RMS-style norm，也可以理解成把向量投到稳定尺度上再做 dot product。这样 attention score 更像方向相似度，而不是被向量范数主导。

Theorem: QK Normalization Bounds Raw Dot Products

If \(\hat{q}=q/\|q\|_2\) and \(\hat{k}=k/\|k\|_2\), then \[ -1\leq \hat{q}^\top \hat{k}\leq 1. \] Thus QK normalization bounds the unscaled attention similarity before the softmax temperature.

Proof

由 Cauchy-Schwarz inequality：

\[ |\hat{q}^\top\hat{k}| \leq \|\hat{q}\|_2\|\hat{k}\|_2 =1. \]

所以 dot product 在 \([-1,1]\) 内。实际实现可能带 learnable scale 或 RMS normalization，但核心思想相同：控制 Q/K 范数对 attention logits 的影响。

Tokenizer and Protocol Tokens

Qwen3 使用 Qwen tokenizer：byte-level BPE，词表大小 \(151{,}669\)。这比 GPT-2 的 \(50{,}257\) 大很多，原因不只是“词更多”，而是现代 LLM tokenizer 同时承担三类职责：

open-vocabulary text encoding；
multilingual/code/math compression；
chat protocol and special token encoding。

Definition: Protocol Token

A protocol token is a special token used to mark structure such as roles, turns, tool calls, thinking boundaries, or end-of-message delimiters.

Qwen3 的 thinking/non-thinking 不是外部 if-else；它通过 chat template 和 token 序列进入模型条件分布：

\[ p_\theta(y\mid x, m), \]

其中 \(m\) 是由 template、/think、/no_think、<think>...</think> 等 token 表达的 mode control。

这说明 tokenizer 不能随意替换。替换 tokenizer 会改变 special token ids、role boundary、thinking boundary 和训练时学到的条件格式。

Pretraining Data

Qwen3 预训练约 36T tokens，覆盖 119 种语言和方言。报告中强调了几类数据来源：

Data source	Purpose
broad web/text/books	general language and world knowledge
PDF-like documents extracted with Qwen2.5-VL	recover high-value document text
Qwen2.5-refined OCR text	improve extracted text quality
Qwen2.5-Math synthetic data	math reasoning pressure
Qwen2.5-Coder synthetic data	code and execution-like patterns
multilingual expansion	cross-lingual coverage

这里的重点是：现代 pretraining 已经不是简单地“抓更多网页”。它是数据工程系统：

collect -> extract -> annotate -> filter -> synthesize -> mix -> stage

报告还提到用多维标签做 instance-level data mixture，而不是只按数据源或域粗粒度调比例。直觉上，一个网页域里可能同时有高质量教程、低质量列表、代码片段和广告；instance-level label 比 domain-level label 更细。

Three-Stage Pretraining Curriculum

Qwen3 pretraining 分三阶段：

Stage	Tokens / length	Goal	Data pressure
S1 General	over 30T tokens, 4K length	broad language/world knowledge	multilingual broad corpus
S2 Reasoning	about 5T high-quality tokens, 4K length	STEM/code/reasoning	reasoning-heavy mixture
S3 Long Context	hundreds of billions tokens, 32K length	long-context adaptation	long documents and long sequences

用训练分布写，这不是一个固定 \(p_{\text{data}}\)，而是阶段化分布：

\[ p_{\text{train}}(x) = \begin{cases} p_1(x), & \text{general stage},\\ p_2(x), & \text{reasoning stage},\\ p_3(x), & \text{long-context stage}. \end{cases} \]

每个阶段仍然优化 next-token prediction：

\[ \mathcal{L}_{\text{NTP}} = - \sum_t \log p_\theta(x_t\mid x_{<t}), \]

但数据分布改变了梯度的语义。S1 的梯度教语言覆盖和世界知识；S2 的梯度增加 STEM/code/reasoning pressure；S3 的梯度让模型适应长距离依赖、position scaling 和 long document statistics。

Definition: Continued Pretraining

Continued pretraining keeps the same next-token objective but changes the data distribution, context length, or schedule to adapt a pretrained model toward new capabilities.

Long-context stage 还配合 RoPE frequency scaling、YaRN 和 Dual Chunk Attention 这类长上下文推理技术。这里要区分：

training length: 训练时实际看多长序列；
position extrapolation: RoPE/YaRN 怎样把位置扩展到更长；
serving memory: KV cache 是否容得下这些长度。

Stage Changes as Gradient Reweighting

三阶段 pretraining 仍然是同一个 next-token loss，但每阶段的 token 分布不同。设阶段 \(s\) 的数据分布为 \(p_s(x)\)，则梯度期望是：

\[ g_s(\theta) = \mathbb{E}_{x\sim p_s} \left[ \nabla_\theta \left( -\sum_t\log p_\theta(x_t\mid x_{<t}) \right) \right]. \]

从 S1 到 S2，不是“模型突然学会推理”，而是 STEM/code/reasoning token 的采样概率上升，使这些任务的梯度在参数更新中占更大权重：

\[ g_{\text{S2}} = \sum_d \alpha_d^{\text{S2}}g_d, \qquad \alpha_{\text{STEM/code}}^{\text{S2}} > \alpha_{\text{STEM/code}}^{\text{S1}}. \]

从 S2 到 S3，长度分布改变。长文档 token 让模型在训练时反复看到：

远距离引用；
长 RoPE position；
多段落 discourse pattern；
packed/long sequence 的 attention statistics。

所以 curriculum 的数学含义是梯度重加权，工程含义是 data loader、sequence packing、position scaling、batch size 和 LR schedule 都随阶段改变。把这些只写成“继续预训练”会漏掉真正的训练系统设计。

Pretraining Stage as a Loader Contract

如果把三阶段 pretraining 落到训练配置，它不是一段注释，而是一组会改变 batch 语义的字段：

stage: s3_long_context
sequence_length: 32768
data_mixture:
  long_documents: 0.70
  code_repositories: 0.10
  math_reasoning: 0.10
  multilingual: 0.10
packing:
  mode: document_boundary_aware
  insert_eos: true
  block_attention_across_docs: optional
position:
  rope_scaling: yarn
optimizer:
  lr: lower_than_s1
  batch_tokens: fixed_global_token_budget

这里每一项都对应一个真实风险：

Field	Why it matters
`sequence_length`	changes activation memory, attention cost, and positional distribution
`data_mixture`	changes gradient weights across domains
`packing.mode`	decides whether documents can attend across boundaries
`insert_eos`	teaches stopping and boundary statistics
`rope_scaling`	changes position embedding semantics
`batch_tokens`	keeps optimizer noise scale comparable across lengths

长上下文阶段尤其要小心 token budget。若 GPU memory 固定，长度从 4K 到 32K，micro-batch sequences 往往下降。为了让 optimizer 看到类似数量的监督 token，需要按 global token count 而不是 sequence count 设 batch：

\[ B_{\text{seq}}\times T \times \text{grad\_accum} \approx \text{target global tokens}. \]

如果只保持 batch_size 不变，S3 会突然把每步 token 数放大 8 倍；如果只保持 num_sequences 能放进显存，S3 又可能让每步 token 数和 gradient noise scale 大幅变化。报告里的 curriculum 背后，其实是 dataloader、packing、LR schedule 和 memory planner 的共同切换。

Pitfall: Stage Length Changes the Optimizer Regime

Changing context length changes tokens per step, activation memory, gradient accumulation, and effective noise scale. A long-context stage should be audited as a training-system change, not merely a data change.

Mixture Sampling as an Optimization Objective

技术报告里的 data mixture 可以写成一个采样器，而不是口号。设数据域 \(d\in\{1,\ldots,D\}\)，每个域有采样权重 \(\alpha_d\)，每步训练先采样域，再采样 document，再 packing 成固定 token budget：

\[ d\sim\operatorname{Categorical}(\alpha), \qquad x\sim p_d(x). \]

于是总体训练目标是：

\[ \mathcal{L}(\theta) = \sum_{d=1}^{D}\alpha_d \mathbb{E}_{x\sim p_d} \left[ -\sum_t \log p_\theta(x_t\mid x_{<t}) \right]. \]

调整 \(\alpha_d\) 就是在调整梯度混合：

\[ \nabla_\theta\mathcal{L} = \sum_d\alpha_d g_d(\theta). \]

这能解释为什么 S2 增加 STEM/code/reasoning 数据会改变模型能力：不是因为 loss 变了，而是因为 \(g_{\text{math}}\)、\(g_{\text{code}}\)、\(g_{\text{reasoning}}\) 在总更新里的权重变大了。

一个生产 dataloader 还要处理温度采样。若原始域大小为 \(n_d\)，可以用：

\[ \alpha_d = \frac{n_d^\tau}{\sum_j n_j^\tau}. \]

当 \(\tau=1\)，按 token 数比例采样；当 \(\tau<1\)，小语种、小域数据被上采样。多语言模型常需要这种 sampling temperature，否则高资源语言会吞掉大部分 batch。

def mixture_weights(sizes, temperature):
    raw = sizes.float().pow(temperature)
    return raw / raw.sum()

Pitfall: Data Mixture Is a Hidden Loss Weight

Changing corpus sampling weights changes the empirical objective. Reported model behavior cannot be reproduced from architecture alone without mixture and stage information.

Post-Training as Behavior Construction

Qwen3 post-training 的核心目标不是简单“对齐”，而是让同一个模型能在 thinking 和 non-thinking 之间切换。

四阶段可以写成：

Stage	Name	Main objective	Behavior learned
1	Long-CoT cold start	SFT on filtered long reasoning traces	reasoning format and trace style
2	Reasoning RL	GRPO on verifiable math/code/STEM tasks	stronger exploration and correctness
3	Thinking mode fusion	SFT on thinking + non-thinking data	controllable mode switching
4	General RL	broad preference/reward optimization	instruction following and general quality

Stage 1 的数据不是随便收长 CoT。报告描述了 query filtering 和 response filtering：去掉不可验证、多子问题、太容易、不需要 CoT 的 query；生成候选 reasoning，再过滤错误答案、重复、猜测、不一致和疑似污染样本。

Stage 2 用 verifiable query-verifier pairs 做 reasoning RL。对数学/代码这类任务，reward 可以来自答案匹配、单元测试或 verifier。GRPO 这类方法不需要显式 value model，而是对同一 query 的多个 rollouts 做 group-relative 更新。

Definition: Verifiable Reward

A verifiable reward is computed by an external checker, such as exact-answer matching, symbolic verification, unit tests, or a task-specific verifier, rather than by subjective preference alone.

Thinking Mode as a Control Variable

Qwen3 的 thinking mode 可以写成带控制变量的条件分布：

\[ p_\theta(y, r\mid x,m), \]

其中 \(r\) 是 reasoning trace，\(y\) 是 final answer，\(m\in\{\text{think},\text{no-think}\}\) 是 mode variable。

Non-thinking mode 希望直接生成：

\[ p_\theta(y\mid x,m=\text{no-think}), \]

thinking mode 则生成 reasoning + answer：

\[ p_\theta(r,y\mid x,m=\text{think}) = p_\theta(r\mid x,m) p_\theta(y\mid x,r,m). \]

这不是换模型，而是改变同一个模型的条件输入和输出协议。Chat template 把 mode control 转成 token 序列；post-training 让模型学会这些 token 的语义。

Pitfall: Thinking Mode Is Not a Hidden Module

Thinking mode is exposed through tokens and generation protocol. It changes the sampled trajectory and compute budget, not the model weights at inference time.

Thinking Budget

Thinking budget 是 inference-time compute control。理想上，我们希望任务难时多想，任务简单时少想：

\[ \text{quality} \approx Q(x, B_{\text{think}}), \qquad \text{latency} \approx C_{\text{prefill}}+C_{\text{decode}}(B_{\text{think}}+|y|). \]

预算越大，模型可以生成更多 reasoning tokens；但 decode 是串行的，延迟和 KV cache 都随 token 数增长。

Qwen3 报告中的机制可以理解为：当 reasoning length 达到用户预算时，runtime 手动结束 thinking segment，插入 stop-thinking instruction，然后让模型基于已有 reasoning 生成 final answer。

简化伪代码：

tokens = []
while len_thinking(tokens) < thinking_budget:
    next_token = sample(model, tokens)
    tokens.append(next_token)
    if next_token == THINK_END:
        break

if not ended_thinking(tokens):
    tokens.extend(tokenize(STOP_THINKING_INSTRUCTION))

answer = decode_until_stop(model, tokens)

这解释了为什么 thinking budget 是系统功能，不只是模型能力。Serving runtime 必须知道 thinking boundary，才能截断 reasoning、插入控制文本、继续生成 answer。

Budget Control as a Finite-State Machine

一个 thinking-aware decoder 至少有三个状态：

ANSWER_DIRECT
THINKING
FINAL_ANSWER

状态转移由 chat template、特殊 token 和 budget 共同决定：

start
  -> THINKING      if enable_thinking and no /no_think override
  -> ANSWER_DIRECT if enable_thinking=False or last flag is /no_think

THINKING
  -> FINAL_ANSWER if </think> generated
  -> FINAL_ANSWER if thinking_budget exhausted and runtime inserts stop instruction

FINAL_ANSWER
  -> stop if EOS / end-of-message generated

这可以写成服务端伪代码：

state = "THINKING" if should_think(messages, template_args) else "ANSWER_DIRECT"
thinking_tokens = 0

while not done:
    token = decode_one(model, kv_cache)
    emit_or_buffer(token, state)

    if state == "THINKING":
        thinking_tokens += 1
        if token == think_end_id:
            state = "FINAL_ANSWER"
        elif thinking_tokens >= thinking_budget:
            append_tokens(tokenizer.encode(STOP_THINKING_INSTRUCTION))
            state = "FINAL_ANSWER"

    elif token in stop_ids:
        done = True

这里最容易出错的是 streaming：reasoning tokens 是否展示给用户、是否写入 conversation history、budget 截断插入的 stop instruction 是否也进入 KV cache，都必须和训练时协议一致。否则下一轮多轮对话会以错误的 hidden context 继续。

Budget, KV Cache, and Latency Accounting

thinking budget 不是 UI 参数，它直接进入 serving 成本。设 prompt 长度为 \(T_p\)，thinking token 数为 \(T_r\)，final answer token 数为 \(T_y\)。prefill 成本大致随 \(T_p^2\) 或 attention kernel 的 prefill 工作量增长；decode 阶段每个新 token 都要 attend 到已有上下文，所以总 decode attention 近似为：

\[ \sum_{i=1}^{T_r+T_y}(T_p+i) = (T_r+T_y)T_p +\frac{(T_r+T_y)(T_r+T_y+1)}{2}. \]

KV cache 长度最终变成：

\[ T_{\text{cache}} = T_p+T_r+T_y. \]

所以把 thinking budget 从 \(512\) 提到 \(8192\)，不仅增加输出 token 费用，还会让后续 answer token 在更长 cache 上 decode。对 MoE 模型，还要叠加 expert dispatch 和负载波动。

Budget decision	System effect
allow long thinking	higher quality ceiling, longer decode latency
force early `</think>`	lower latency, risk incomplete reasoning
hide reasoning from user	parser/buffer complexity
store reasoning in history	better continuity, larger future prompts
drop reasoning from history	cheaper future turns, possible context mismatch

一个严谨服务端应该把这些写入 trace：

{
  "prompt_tokens": 1024,
  "thinking_tokens": 768,
  "answer_tokens": 128,
  "kv_cache_tokens": 1920,
  "budget_exhausted": false,
  "parser_state": "final_answer"
}

Definition: Thinking Budget Trace

A thinking budget trace records prompt, reasoning, answer, cache, parser, and budget-exhaustion fields so that reasoning-mode latency and quality can be audited.

Strong-to-Weak Distillation

Qwen3 对小模型强调 strong-to-weak distillation。形式上，teacher 给 student 提供更高质量的条件分布或输出：

\[ \mathcal{L}_{\text{KD}} = \operatorname{KL} \left( p_T(\cdot\mid x) \Vert p_S(\cdot\mid x) \right), \]

或者提供 sampled responses / reasoning traces / preference labels。报告指出，直接从强 teacher 蒸馏到轻量 student 可以比为每个小模型完整跑四阶段 post-training 更高效。

从训练范式看，这是把 expensive exploration 从小模型身上移到大模型：

large teacher explores / reasons / filters
  -> student imitates selected behavior
  -> optional on-policy refinement

Definition: Strong-to-Weak Distillation

Strong-to-weak distillation transfers behavior from a stronger teacher model to a smaller student through logits, generated traces, filtered responses, preferences, or verifier-selected samples.

这也是为什么小模型能力不只由参数量决定。若 teacher、filter、verifier 和 curriculum 设计得好，小模型可以获得远超“自己从零 RL 探索”的行为先验。

Inference and Chat Template

官方用 apply_chat_template 控制 mode：

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)

关闭 thinking：

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

在 soft switch 用法中，用户也可以在 prompt 或 system message 中使用 /think 和 /no_think。多轮对话里，模型遵循最近的 mode flag。官方 model card 还提示：当 enable_thinking=True 时，即使用 /no_think，输出协议也可能保留 <think>...</think> block，只是内容为空；当 enable_thinking=False 时，不产生 think block。

推理服务必须解析：

prompt template；
<think> 和 </think>；
final answer；
reasoning parser；
streaming 时是否展示 reasoning；
budget 截断后的 continuation。

Pitfall: Output Parsing Is Part of the Model Interface

If the serving layer ignores thinking tags or reasoning parser behavior, it can leak intermediate reasoning, drop final answers, or corrupt multi-turn state.

Minimal Inference Contract

一个 Qwen3-style endpoint 不应该只暴露 prompt -> text。更稳的接口至少返回：

{
  "reasoning": "...",
  "answer": "...",
  "finish_reason": "stop",
  "thinking_tokens": 128,
  "answer_tokens": 64,
  "mode": "think"
}

这些字段让调用方能区分三件事：

模型真的没有 thinking；
模型 thinking 为空但协议上有空 block；
模型 thinking 被 budget 截断后生成 final answer。

对评测来说，这也很重要。若 benchmark 只截取最后字符串，却没有记录 thinking budget 和 parser 规则，同一个 checkpoint 的结果可能不可复现。

Reasoning Parser Edge Cases

官方示例用 </think> token 的最后一次出现来分离 reasoning 和 final answer。真正的服务端 parser 还要处理更多边界：

模型没有生成 </think>；
模型生成多个 </think>；
enable_thinking=True 但用户用 /no_think 让 think block 为空；
streaming 时 </think> 被拆在不同 chunk；
budget 截断后 runtime 插入 stop-thinking instruction；
多轮对话中历史是否保存 reasoning。

一个更稳的 parser 应该在 token 层工作，而不是只在字符串层正则匹配：

def split_qwen3_output(output_ids, think_end_id):
    # Return token spans; decoding happens after the split.
    try:
        split = len(output_ids) - 1 - output_ids[::-1].index(think_end_id)
    except ValueError:
        return {
            "reasoning_ids": [],
            "answer_ids": output_ids,
            "has_think_end": False,
        }

    return {
        "reasoning_ids": output_ids[: split + 1],
        "answer_ids": output_ids[split + 1 :],
        "has_think_end": True,
    }

流式输出时，服务端可以维护一个小状态机：

BUFFER_REASONING -> EMIT_ANSWER -> DONE

在 BUFFER_REASONING 中，tokens 可以被缓存或发送到单独的 reasoning_content 字段；看到 </think> 后才切换到 answer stream。若没有这个状态机，就容易把 reasoning 泄漏到 content，或把 final answer 错放进 reasoning 字段。

Pitfall: Parser Policy Affects Benchmark Scores

If one evaluator strips reasoning by token id and another strips by text regex, malformed or truncated outputs can produce different final answers. Report parser policy together with thinking budget.

Conversation History Policy

Qwen3 的 soft switch 还带来一个容易忽略的问题：多轮对话历史到底保存什么。假设一轮输出是：

<think>r_1 ... r_m</think> y_1 ... y_n

下一轮 prompt 可以保存三种不同历史：

History policy	Next-turn context	Trade-off
full	reasoning + answer	maximum continuity, highest token cost, may expose hidden reasoning
answer-only	final answer only	cheaper and safer, may lose reasoning state
structured	reasoning stored out-of-band	auditable but requires model/template support

如果训练时多轮样本保留 reasoning，而 serving 时删掉 reasoning，模型下一轮看到的 context distribution 就变了；反之，如果训练时只保留 final answer，serving 却把 reasoning 塞回 history，模型可能把旧 reasoning 当成用户可见事实继续引用。

这可以形式化为 history transform \(H\)：

\[ p_\theta(y_{t+1}\mid H(m_{\leq t})). \]

不同 \(H\) 定义的是不同条件分布。评测 Qwen3-style reasoning 模型时，应同时报告 mode flag、thinking budget、parser policy 和 history policy。

Serving Implications

Qwen3 官方 model cards mention vLLM and SGLang support. For MoE and thinking models, serving has extra concerns:

Concern	Why it matters
GQA KV layout	KV cache shape uses \(H_{kv}\), not \(H_q\)
long context	32K/128K contexts can dominate memory
thinking tokens	reasoning increases decode length and latency
reasoning parser	endpoint must separate reasoning and answer
MoE routing	expert placement and load affect throughput
`/think` state	multi-turn conversations need mode control

For Qwen3-235B-A22B, the model card lists 235B total parameters and 22B activated parameters, with 94 layers, 64 Q heads, 4 KV heads, 128 experts, and 8 activated experts. This is not a model you “just load” without thinking about device mapping, tensor/expert parallelism, KV memory and serving framework support.

What Qwen3 Teaches Beyond GPT-2

GPT-2 的 mental model:

\[ \text{BBPE tokens} \rightarrow \text{decoder-only Transformer} \rightarrow \text{next-token prediction} \rightarrow \text{prompted generation}. \]

Qwen3 的 mental model:

\[ \text{BBPE/chat/protocol tokens} \rightarrow \text{GQA/RoPE/RMSNorm/SwiGLU/QK-Norm decoder} \rightarrow \text{dense or MoE capacity scaling} \rightarrow \text{multi-stage data curriculum} \rightarrow \text{thinking-aware post-training} \rightarrow \text{budgeted reasoning serving}. \]

核心 objective 仍然是 next-token prediction，但现代 LLM 的行为来自四件事的组合：

architecture controls compute, memory and inductive bias；
data curriculum controls capability pressure；
post-training controls response style and reasoning protocol；
serving runtime controls how much inference-time compute is actually spent。

How to Read Future Model Reports

When reading a new LLM report, build four tables: architecture shapes, data stages, post-training stages, and serving protocol. If one table is missing, you probably do not yet understand the model as a system.