LLM Serving Infrastructure

LLM serving 和 LLM training 是两套不同的系统。训练系统关心的是如何把大量 token 变成稳定的参数更新；serving 系统关心的是如何在有限显存、有限延迟和动态请求流下，持续执行

\[ p_\theta(x_{t+1}\mid x_{\leq t}) \]

并把 token 流稳定地送回用户。这里的核心不只是“模型能跑起来”，而是能否回答这些工程问题：

显存能同时容纳多少请求的 KV cache？
prefill 和 decode 如何调度才不互相拖垮？
batch 是固定的，还是 continuous batching？
单卡、双卡、碎片 GPU、整机多卡分别适合什么策略？
线上应该监控哪些指标才能发现系统退化？

这节把 LLM inference 看成一个 runtime system，而不是一个 model.generate() 函数。

Request Lifecycle

Definition: Serving Request State

A serving request state contains tokenized input, prompt length, generated tokens, KV-cache block handles, position state, sampling parameters, stream buffer, and stopping conditions.

一个请求通常经历：

HTTP/gRPC request
  -> tokenize
  -> admission control
  -> prefill queue
  -> prefill execution
  -> decode queue
  -> decode loop
  -> detokenize/stream
  -> finish and free KV blocks

每一步都可能成为瓶颈：

Stage	Main work	Main bottleneck	User-facing symptom
tokenize	text to token ids	CPU / tokenizer implementation	request accepted slowly
admission	memory and queue check	scheduling policy	request rejected or queued
prefill	process prompt	compute and attention over prompt	high TTFT
decode	one token per step	KV bandwidth and scheduler overhead	low tokens/sec
stream	detokenize/send	network and client backpressure	bursty output
cleanup	release cache	allocator fragmentation	memory leak-like behavior

Definition: TTFT and TPOT

Time to first token (TTFT) measures latency until the first generated token. Time per output token (TPOT) measures average latency for subsequent decode tokens.

TTFT 和 TPOT 经常被混在一起，但它们由不同系统部件决定。长 prompt 会压高 TTFT；高并发 decode 会压低 TPOT；慢客户端会让 stream buffer 反向压住 runtime。

Memory Budget First

Serving capacity 首先是显存预算问题。总显存近似拆成：

\[ M_{\text{total}} \geq M_{\text{weights}} + M_{\text{KV}} + M_{\text{workspace}} + M_{\text{fragmentation}} + M_{\text{runtime}}. \]

权重内存为

\[ M_{\text{weights}} \approx N_{\text{params}}\cdot b_w, \]

其中 \(b_w\) 是每个参数的 bytes。KV cache 对 decoder-only 模型约为

\[ M_{\text{KV}} = 2\cdot B\cdot T\cdot L\cdot H_{kv}\cdot d_h\cdot b_{kv}. \]

这里 \(2\) 表示 K/V 两份，\(B\) 是 active sequences，\(T\) 是每条序列占用的上下文长度，\(L\) 是层数，\(H_{kv}\) 是 KV heads，\(d_h\) 是 head dim，\(b_{kv}\) 是 KV dtype bytes。

Theorem: Admission Control Memory Bound

For a fixed model replica, a new request with maximum allocated length \(T_{\text{new}}\) is safe to admit only if the post-admission memory budget satisfies \[ M_{\text{weights}}+ M_{\text{KV,current}}+ \Delta M_{\text{KV,new}}+ M_{\text{workspace}}+ M_{\text{reserve}} \leq M_{\text{GPU}}. \]

Proof

Serving runtime must keep weights resident while requests are active. Existing requests already occupy \(M_{\text{KV,current}}\); the new request needs additional KV blocks

\[ \Delta M_{\text{KV,new}} = 2T_{\text{new}}LH_{kv}d_hb_{kv} \]

for each admitted sequence. Kernels also need temporary workspace, CUDA graphs may reserve memory, and allocators need slack to avoid fragmentation failure. If the sum exceeds device memory, either allocation fails, the process OOMs, or the scheduler must evict/preempt some active request. Thus the inequality is a necessary safety condition.

这条 bound 说明：max_new_tokens 和 max_model_len 不是 UI 选项，它们直接改变 admission control。一个请求实际 prompt 只有 200 tokens，但如果系统为它预留 32k context，容量会被保守预留吃掉。

Concrete Capacity Calculation

假设：

model weights: 14B parameters, BF16 weights, so about \(28\) GB；
\(L=40\) layers；
\(H_{kv}=8\)；
\(d_h=128\)；
KV dtype BF16, \(b_{kv}=2\)；
available GPU memory after runtime reserve: \(76\) GB；
workspace and fragmentation reserve: \(8\) GB。

可用于 KV 的预算约为

\[ 76-28-8=40\text{ GB}. \]

每条 8k context 请求的 KV 约为

\[ 2\cdot8192\cdot40\cdot8\cdot128\cdot2 \approx 1.34\text{ GB}. \]

所以理论上最多约

\[ \left\lfloor \frac{40}{1.34}\right\rfloor=29 \]

条 8k active sequences。实际可承载数量还会低一些，因为请求长度不均、allocator 有碎片、kernel workspace 随 batch shape 变化，prefill 峰值也会额外占用内存。

Pitfall: Context Length Is a Capacity Multiplier

Doubling context length roughly doubles KV memory. A serving system that is stable at 4k context can become unstable at 8k even when model weights and QPS are unchanged.

Prefill and Decode Are Two Queues

Prefill 和 decode 的计算形态不同：

Stage	Shape	Hardware pressure	Scheduling issue
prefill	\(B\times T_p\)	GEMM/attention compute	long prompts can block TTFT
decode	\(B\times1\) per step	KV read bandwidth	many active sequences compete

Serving runtime 通常需要把它们当成两个队列：

prefill queue: new requests waiting for prompt processing
decode set: active requests already holding KV cache

若 prefill 太贪心，长 prompt 会占满 GPU，已有用户的 streaming decode 会卡顿。若 decode 太贪心，新请求迟迟进不了 prefill，TTFT 会爆炸。

Definition: Chunked Prefill

Chunked prefill splits a long prompt into smaller chunks so that prefill work can be interleaved with decode work, reducing disruption to streaming requests.

一个简化的 scheduler loop：

while True:
    batch = []

    decode_items = pick_decode_requests(active, token_budget=decode_budget)
    batch.extend(decode_items)

    prefill_items = pick_prefill_chunks(
        waiting,
        token_budget=prefill_budget,
        memory_budget=free_kv_blocks,
    )
    batch.extend(prefill_items)

    outputs = model.forward(batch)
    update_kv_blocks(outputs)
    sample_next_tokens(outputs)
    release_finished_requests()

这段伪代码看起来像普通 batch，但关键在于 token_budget 和 memory_budget。LLM serving 的 batch size 不应只按 request 数算，而要按 token 数、KV blocks、prefill/decode 混合比例来算。

Continuous Batching

Definition: Continuous Batching

Continuous batching lets requests enter and leave the active batch at token boundaries, instead of forcing a static batch to wait until every sequence finishes.

静态 batching 的问题是序列长度不一致。假设一批里有三个请求，输出长度分别是 20、200、400 tokens。静态 batch 会让短请求陪跑；continuous batching 允许短请求结束后释放 KV，新请求立刻加入。

Continuous batching 需要 runtime 管理：

每个 request 的 KV block table；
每个 request 的 current position；
sampling parameters；
stop sequences；
streaming output buffer；
finished request 的 cache release；
preempted request 的状态保存。

Pitfall: Dynamic Batching Moves Complexity into the Runtime

Continuous batching improves utilization, but it requires explicit state management. Bugs often appear as wrong position ids, cross-request KV leakage, unreleased cache blocks, or stop-condition errors.

KV Blocks and Fragmentation

连续分配 KV cache 的朴素方案容易浪费。请求长度不同、生成长度未知，如果每个请求都按最大长度分配连续空间，会出现大量 unused slots。

Paged-style KV management 把 cache 切成固定大小 blocks：

request A -> blocks [5, 9, 12]
request B -> blocks [7]
request C -> blocks [1, 3, 8, 10]

Attention kernel 通过 block table 找到逻辑位置对应的物理 KV block。这带来两个收益：

Benefit	Explanation
less internal waste	request only gets blocks it actually uses
easier reuse	finished request returns blocks to free list
preemption support	block table can be swapped or remapped
continuous batching	requests with different lengths coexist

但 block size 有 trade-off。Block 太大，内部碎片增加；block 太小，block table 和 kernel indirection 开销变大。

KV Allocator Invariants

Paged KV cache 的 allocator 可以看作一个固定大小 block pool。设 block size 为 \(P\) tokens，总 block 数为 \(N_{\text{blk}}\)。请求 \(r\) 当前占用长度为 \(T_r\)，需要 block 数

\[ B_r=\left\lceil \frac{T_r}{P}\right\rceil. \]

内部碎片来自最后一个 block 没用满：

\[ F_{\text{internal}}(r) = B_rP-T_r. \]

整个系统的有效利用率可以写成：

\[ U_{\text{KV}} = \frac{\sum_r T_r} {P\sum_r B_r}. \]

这比只看 allocated_blocks / total_blocks 更有用：前者告诉你显存有没有被占，后者告诉你占住的显存里有多少是真正的 token。

Definition: KV Block Table

A KV block table maps a request’s logical token-block indices to physical KV-cache blocks. Attention kernels use this table to read non-contiguous cache blocks as one logical sequence.

一个最小 allocator 需要维护三类状态：

State	Meaning	Invariant
`free_blocks`	可分配 block 集合	不和任何 request table 重叠
`block_table[req]`	request 的逻辑到物理 block 映射	长度等于 request 当前 reserved blocks
`refcount[block]`	block 被多少 request 引用	prefix sharing / beam search 需要

没有 prefix sharing 时，refcount 只有 \(0\) 或 \(1\)。有 prefix cache、beam search、parallel sampling 或 speculative decoding 时，同一个 prefix block 可能被多个 request 暂时共享，直到某个 request 需要写入新 token。这时 allocator 必须支持 copy-on-write：

def append_token(req, token_kv):
    table = block_table[req.id]
    if req.pos % block_size == 0:
        block = free_blocks.pop()
        table.append(block)
        refcount[block] = 1

    block = table[-1]
    if refcount[block] > 1:
        new_block = free_blocks.pop()
        copy_prefix_kv(src=block, dst=new_block, upto=req.pos % block_size)
        refcount[block] -= 1
        refcount[new_block] = 1
        table[-1] = new_block
        block = new_block

    write_kv(block, offset=req.pos % block_size, kv=token_kv)
    req.pos += 1

这里的关键不是 Python 代码本身，而是两个不变量：

\[ \sum_b \mathbf{1}[\text{refcount}_b>0]+|\text{free\_blocks}|=N_{\text{blk}}, \]

以及任意 request 的 logical order 必须由 block_table 唯一决定，不能依赖物理 block id 的大小。否则释放、重用或 prefix sharing 后，attention kernel 可能读到别的请求的 KV。

Pitfall: KV Leaks Can Look Like Capacity Decay

If finished, cancelled, or preempted requests do not release every referenced block, the service may pass early load tests and then slowly lose capacity.

一个 smoke test 可以很小：

before = allocator.free_count()
reqs = [allocator.new_request(max_tokens=257) for _ in range(8)]
for req in reqs:
    for _ in range(req.max_tokens):
        allocator.append(req)
for req in reqs:
    allocator.free(req)
assert allocator.free_count() == before
assert allocator.active_refcount_sum() == 0

Throughput, Latency, and Little’s Law

Theorem: Little’s Law

For a stable system, the average number of requests in the system satisfies \[ L=\lambda W, \] where \(\lambda\) is arrival rate and \(W\) is average time spent in the system.

在 serving 里，这意味着：如果 QPS \(\lambda\) 上升，而平均完成时间 \(W\) 没降，系统内排队请求数 \(L\) 必然增加。排队增加又会进一步抬高 TTFT，形成延迟雪崩。

Proof Sketch

观察一个长时间窗口 \([0,T]\)。若系统稳定，进入和完成的请求数近似相同，为 \(N\approx \lambda T\)。每个请求在系统中停留 \(W_i\) 时间，总占用面积为 \(\sum_i W_i\)。平均系统内请求数是

\[ L\approx \frac{1}{T}\sum_i W_i = \frac{N}{T}\cdot \frac{1}{N}\sum_i W_i = \lambda W. \]

所以 capacity planning 不能只看 tokens/sec 峰值。还要看 tail latency、arrival burst、prompt length distribution 和 output length distribution。

Single GPU, Multi-GPU, and Fragmented GPUs

部署策略取决于模型大小和 GPU 形态。

Setup	Strategy	Good for	Risk
one model fits one GPU	replica per GPU	simple serving, high availability	limited per-request context/batch
tensor parallel	split layers’ matrix ops across GPUs	model too large for one GPU	communication overhead
pipeline parallel	split layers across GPUs	very large models	bubble and scheduling complexity
CPU/offload	move weights/KV partly off GPU	extreme memory pressure	high latency
fragmented single GPUs	many independent workers	batch inference, rollouts, data generation	load balancing and stragglers

Definition: Fragmented GPU Serving

Fragmented GPU serving uses many small or independently scheduled GPU workers as separate replicas, coordinated by a queue or orchestration layer, rather than treating them as one tightly coupled multi-GPU device.

在碎卡场景里，最稳的默认策略通常是：

每张 GPU 跑一个 worker；
tensor parallel size 保持为 1；
请求队列按 worker 可用 KV budget 和 expected output length 分发；
对离线生成任务使用 task queue；
对在线任务使用 replica health check 和 admission control；
避免把不稳定或网络互联差的碎卡强行拼成同步训练 job。

如果模型单卡放不下，才考虑 tensor parallel、量化、offload 或换更小模型。框架能调度碎卡，但不能消除跨机器通信成本。

Pitfall: Fragmented GPUs Are Not a Free Large GPU

Schedulers can coordinate many single-GPU workers, but they cannot make weakly connected devices behave like one high-bandwidth multi-GPU node.

Heterogeneous Worker Placement

碎卡环境里，每张卡的有效能力可能不同：显存、算力、网络、当前队列、已加载模型、可用上下文长度都不一样。把所有 worker 当成同质 replica，会导致两类问题：

小卡被分到长上下文请求，频繁 reject 或 preempt；
大卡被短请求填满，真正需要大显存的请求没有地方去。

可以把 worker 状态写成资源向量：

\[ c_w= \left( M_w^{\text{free}}, B_w^{\text{free}}, Q_w^{\text{prefill}}, D_w^{\text{decode}}, L_w^{\max}, \rho_w \right), \]

其中 \(M_w^{\text{free}}\) 是剩余显存，\(B_w^{\text{free}}\) 是剩余 KV blocks，\(Q_w^{\text{prefill}}\) 是 prefill queue tokens，\(D_w^{\text{decode}}\) 是活跃 decode 序列数，\(L_w^{\max}\) 是该 worker 支持的最大上下文，\(\rho_w\) 是近期错误率或健康惩罚。

请求 \(r\) 的需求向量可以写成：

\[ d_r= \left( T_r^{\text{prompt}}, T_r^{\text{reserve}}, \text{model}_r, \text{adapter}_r, \text{slo}_r \right). \]

feasibility 先于打分：

\[ \operatorname{feasible}(w,r) = \mathbf{1}\left[ T_r^{\text{reserve}}\le L_w^{\max} \land B_r^{\text{reserve}}\le B_w^{\text{free}} \land \text{model}_r=\text{model}_w \right]. \]

然后再做 placement score。一个实用的思想是 least slack：把长请求放到“刚好能容纳它”的 worker，而不是总丢给最大卡。

def feasible(worker, req):
    if req.model != worker.model:
        return False
    if req.total_reserved_tokens > worker.max_len:
        return False
    if req.reserved_blocks > worker.free_blocks:
        return False
    if req.adapter not in worker.supported_adapters:
        return False
    return True

def placement_score(worker, req):
    kv_slack = worker.free_blocks - req.reserved_blocks
    len_slack = worker.max_len - req.total_reserved_tokens
    adapter_miss = req.adapter not in worker.loaded_adapters
    return (
        0.05 * kv_slack
        + 0.001 * len_slack
        + 2.0 * worker.prefill_queue_tokens
        + 1.0 * worker.active_decode_sequences
        + 50.0 * worker.error_rate_1m
        + 20.0 * adapter_miss
    )

这里 kv_slack 和 len_slack 用正权重，是为了惩罚把小请求放到过大的剩余空间上；queue 和 error 用正权重，是为了避开热 worker。真实权重需要压测校准，但形式上要把“能不能放”和“放哪里最好”分开。

Operational Rule: Reserve Large Workers for Large Contexts

On heterogeneous GPUs, keep some high-memory workers protected for long-context or high-output requests. Otherwise short interactive traffic can fragment the only workers that can admit large jobs.

Routing and Admission Control

一个实用 router 不只做 round-robin。它应该观察：

Signal	Meaning
free KV blocks	whether a worker can admit long contexts
active decode count	expected TPOT pressure
prefill queue length	expected TTFT pressure
recent OOM/preemption	worker health
model/version	compatibility with request
quantization/context limit	capability constraints

简单策略：

def score(worker, request):
    if request.prompt_len + request.max_new_tokens > worker.max_len:
        return None
    if worker.free_kv_tokens < request.reserve_tokens:
        return None
    return (
        2.0 * worker.prefill_queue_tokens
        + 1.0 * worker.active_decode_sequences
        - 0.5 * worker.free_kv_tokens
        + 10.0 * worker.recent_error_rate
    )

分数最低的 worker 接请求。真实系统会更复杂，但原则一样：routing 应该面向资源瓶颈，而不是只看 worker 数。

Admission as a Reservation Problem

Admission control 最容易犯的错误，是只看当前显存，而不看未来生成长度。一个请求进入系统时，prompt length 已知，output length 只知道上限或估计值。runtime 通常要为它预留：

\[ T_{\text{reserve}} = T_{\text{prompt}}+\hat{T}_{\text{output}}, \]

其中 \(\hat{T}_{\text{output}}\) 可以取 max_new_tokens，也可以取基于历史分布的分位数估计：

\[ \hat{T}_{\text{output}} = Q_{0.95}\left(T_{\text{output}}\mid \text{route},\text{model},\text{request class}\right). \]

保守预留会降低利用率；乐观预留会导致运行中 KV 不够，需要 preempt、swap 或 reject。一个可解释的 admission rule 可以写成：

def reserve_tokens(request, policy):
    if policy == "hard_cap":
        return request.prompt_len + request.max_new_tokens
    if policy == "p95":
        expected = output_len_p95(
            model=request.model,
            route=request.route,
            request_class=request.request_class,
        )
        return request.prompt_len + min(request.max_new_tokens, expected)
    raise ValueError(policy)

然后把 reservation 转成 KV block 数：

\[ \Delta B_{\text{KV}} = \left\lceil \frac{T_{\text{reserve}}}{P} \right\rceil, \]

其中 \(P\) 是 block size。只有当 worker 的 free blocks 足够，并且 queueing delay 估计不超过 SLO，才接收请求。

Definition: SLO-Aware Admission

SLO-aware admission accepts a request only when both memory reservation and expected latency satisfy the service-level objective for that request class.

这让 serving 从“来一个接一个”变成资源合约：接进来的请求必须有足够高概率完成，而不是把 OOM 和超时推迟到 decode 中途。

Request Classes and Fairness

真实系统常有不同请求类型：

Class	Example	Priority
interactive chat	user waiting in UI	low TTFT and stable TPOT
agent loop	tool call waiting for next step	stop correctness and moderate latency
batch generation	offline synthetic data	throughput and cost
evaluation	benchmark jobs	determinism and reproducibility
background summarization	async tasks	opportunistic capacity

如果只按到达顺序排队，长 prompt batch job 可能压住在线 chat；如果只按短请求优先，长上下文任务可能长期饥饿。一个简单的做法是按 class 维护队列，并给每类设置 token budget share：

interactive: 50% decode budget, 30% prefill budget
agent:       25% decode budget, 20% prefill budget
batch:       25% decode budget, 50% prefill budget

调度器每个 tick 都可以在 class 内部做 shortest-prefill-first 或 earliest-deadline-first：

def pick_prefill(queue, now, budget):
    queue.sort(key=lambda r: (r.deadline_ms - now, r.prompt_len))
    picked = []
    used = 0
    for req in queue:
        if used + req.next_chunk_tokens > budget:
            continue
        picked.append(req)
        used += req.next_chunk_tokens
    return picked

Pitfall: Throughput-Only Scheduling Can Break Product Semantics

Maximizing tokens/sec can starve latency-sensitive requests. Serving policies should be written against request classes and SLOs, not just aggregate throughput.

Prefix, Adapter, and Grammar-Aware Routing

Routing 还要考虑 compatibility。两个请求能不能合 batch，不只取决于 model name，还取决于：

tokenizer and chat template；
LoRA/adapter id；
quantization and KV dtype；
grammar or JSON schema；
speculative draft model；
prefix cache hit probability；
maximum context length。

例如多租户 LoRA serving 中，worker 可能同时加载 base model 和若干 adapters。请求路由到已经加载目标 adapter 的 worker，可以避免 adapter load latency；但如果该 worker KV 很满，强行路由又会牺牲 TTFT。

可以把 routing score 拆成几项：

\[ S(w,r) = \alpha Q_w +\beta D_w -\gamma K_w -\delta C_{w,r} -\eta A_{w,r}, \]

其中 \(Q_w\) 是 queue tokens，\(D_w\) 是 active decode count，\(K_w\) 是 free KV tokens，\(C_{w,r}\) 是 prefix-cache benefit，\(A_{w,r}\) 是 adapter already loaded 的 benefit。分数越低越好。

def route_score(worker, req):
    if req.model != worker.model:
        return None
    if req.adapter not in worker.supported_adapters:
        return None
    if req.total_len > worker.max_len:
        return None

    prefix_bonus = worker.prefix_cache.estimated_hit_tokens(req.prefix_hash)
    adapter_bonus = 1.0 if req.adapter in worker.loaded_adapters else 0.0
    return (
        2.0 * worker.prefill_queue_tokens
        + 1.0 * worker.active_decode_sequences
        - 0.25 * worker.free_kv_tokens
        - 0.5 * prefix_bonus
        - 20.0 * adapter_bonus
        + 50.0 * worker.error_rate_1m
    )

Pitfall: Prefix Cache Hits Are Conditional

A prefix cache hit is valid only when tokens, position scheme, adapter, system prompt, model weights, and relevant decoding context match. Textual prefix equality alone is not enough.

Quantization, Speculation, and Trade-Offs

Serving optimization 常见手段：

Technique	Saves	Cost
weight quantization	weight memory and bandwidth	quality risk, kernel constraints
KV quantization	KV memory and decode bandwidth	attention accuracy risk
GQA/MQA	KV cache size	architecture-level quality trade-off
speculative decoding	decode latency	draft model overhead and acceptance variance
prefix cache	repeated prompt TTFT	cache invalidation and memory
batching	throughput	per-request latency can rise
offload	GPU memory	PCIe/network latency

这些优化不是互相独立的。比如量化降低显存后可以提高 batch capacity，但如果 kernel 不够快，TPOT 不一定改善；speculative decoding 对短回答未必划算，因为 draft/verify overhead 可能超过收益。

Streaming, Cancellation, and Backpressure

线上 serving 不是只返回一个最终字符串。大多数交互式应用需要 streaming，而 streaming 会把客户端状态带进 runtime。一个请求至少有这些状态：

QUEUED -> PREFILLING -> DECODING -> STREAMING -> FINISHED
                 \          \          \
                  \          \          -> CANCELLED
                   \          -> PREEMPTED
                    -> REJECTED

Definition: Backpressure

Backpressure occurs when downstream consumers, such as HTTP clients or stream buffers, cannot receive tokens as fast as the model runtime produces them.

如果客户端很慢，server 有三种策略：

Policy	Behavior	Risk
buffer	keep generated tokens in memory	memory growth
throttle	delay future decode steps for that request	lower GPU utilization
disconnect	cancel slow client after timeout	user-visible failure

对流式接口，finish_reason 也应该是协议的一部分：

Finish reason	Runtime action
`eos`	normal cleanup, release KV
`length`	stop at max tokens, release KV
`stop_sequence`	stop after parser confirms boundary
`cancelled`	stop decode immediately, release KV
`timeout`	release KV, record timeout class
`preempted`	either swap state or return retryable error

取消请求尤其容易漏内存。正确的 cancellation handler 至少要做：

def cancel_request(req, reason):
    req.state = "CANCELLED"
    scheduler.remove(req.id)
    stream.close(req.id, reason=reason)
    allocator.free(req.id)
    metrics.count("request_cancelled", reason=reason)

这里的顺序有讲究：先从 scheduler 移除，避免下一轮 decode 又选到它；再关闭 stream，让客户端看到终止；最后释放 KV blocks。若先释放 KV 但请求仍在 active set，下一步 attention 可能读到已经被其他请求复用的 block。

Pitfall: Cancellation Is a Memory-Safety Path

Cancellation, timeout, and client disconnect must go through the same KV cleanup path as normal EOS. Treat them as first-class finish reasons, not exceptional side branches.

streaming 还会影响指标解释。一个请求的 model TPOT 可能很低，但用户看到的 token 间隔很高，原因只是网络或客户端读取慢。因此 trace 里最好分开记录：

\[ \text{TPOT}_{\text{model}} = \frac{t_{\text{last\_sample}}-t_{\text{first\_sample}}} {N_{\text{output}}-1}, \qquad \text{TPOT}_{\text{client}} = \frac{t_{\text{last\_sent}}-t_{\text{first\_sent}}} {N_{\text{output}}-1}. \]

如果两者差距很大，优化 attention kernel 没有意义，应该看 stream buffer、network 和 client backpressure。

Observability

Serving 系统至少应该记录：

QPS 和 token/s；
TTFT p50/p95/p99；
TPOT p50/p95/p99；
prompt length 和 output length histogram；
active sequences；
KV block usage and fragmentation；
prefill queue tokens；
decode batch size；
OOM、preemption、timeout、client disconnect；
per-worker GPU memory、SM utilization、memory bandwidth。

Debugging Rule

When latency regresses, split it into queueing time, prefill time, decode time, and streaming time before changing model code.

一个常见诊断表：

Symptom	Likely cause
TTFT high, TPOT normal	prefill queue too long or prompts too long
TTFT normal, TPOT high	decode batch too large or KV bandwidth saturated
sudden OOM	admission reserve too optimistic
GPU low utilization, high latency	CPU/tokenizer/router bottleneck
output stalls in bursts	streaming backpressure or scheduler imbalance
capacity slowly drops	KV blocks not released or fragmentation

Trace Schema for Debugging

只记录 aggregate metrics 不够。一次请求慢了，需要能拆出它在哪一步慢。一个最小 trace 可以长这样：

{
  "request_id": "req_123",
  "model": "qwen3-8b",
  "adapter": "base",
  "prompt_tokens": 1832,
  "output_tokens": 256,
  "reserved_kv_tokens": 4096,
  "queue_ms": 42.1,
  "prefill_ms": 310.4,
  "decode_ms": 1180.2,
  "stream_ms": 37.8,
  "ttft_ms": 355.0,
  "tpot_ms": 4.61,
  "finish_reason": "eos",
  "worker_id": "gpu-03",
  "kv_blocks_peak": 128,
  "preemptions": 0
}

这个 trace 让你能把用户抱怨“慢”拆成四种不同问题：

Dominant field	Interpretation
`queue_ms` high	router/admission overloaded
`prefill_ms` high	long prompt, weak prefill batching, CPU tokenization
`decode_ms` high	output long, TPOT high, KV bandwidth bottleneck
`stream_ms` high	client/network backpressure

Debugging Rule

Always debug latency from per-request traces first, then aggregate metrics. Averages hide whether the problem is queueing, prefill, decode, or streaming.

Alerting and Saturation Signals

Serving 系统的 alert 不应该只盯 GPU memory。更有用的是 saturation signals：

Alert	Trigger idea	Meaning
TTFT p95 high	`ttft_p95 > SLO` for 5 min	new requests wait too long
TPOT p95 high	`tpot_p95 > SLO`	decode saturated
KV pressure	`free_kv_blocks / total < 10%`	admission close to OOM
queue growth	queue tokens increasing monotonically	arrival rate exceeds service rate
preemption spike	preemptions per minute high	memory reservation too optimistic
prefix hit drop	hit rate suddenly lower	template/cache-key drift
worker skew	one worker much hotter	router imbalance

Little’s Law 给一个很实用的判断：若 arrival rate \(\lambda\) 没变但 average in-system requests \(L\) 变大，那么 average latency \(W\) 必然变大。也就是说，queue depth 本身就是 latency 的先行指标。

Capacity Test Plan

上线前应做三类压测，而不是只跑一个 tokens/sec：

Steady load: 固定 QPS、固定 prompt/output 分布，观察 TTFT/TPOT 是否稳定；
Burst load: 短时间把 arrival rate 提高，观察 queue recovery 和 overload policy；
Adversarial length: 大量接近 max context 的请求，观察 KV reservation、preemption 和 OOM。

一个压测矩阵：

Axis	Values
prompt length	128, 2k, 8k, max context
output cap	64, 512, 2k
request class	chat, agent, batch
sampling	greedy, top-p, grammar constrained
adapter	base, LoRA A, LoRA B
arrival	steady, burst, heavy-tail

压测的输出不是一个单点数字，而是一张 operating envelope：在什么 prompt/output 分布下，系统能满足什么 SLO；超过边界时，是 queue、preempt、reject 还是降级。

Deployment Checklist

上线一个 LLM serving endpoint 前，至少要明确：

model dtype and quantization；
max prompt length、max output length、max total length；
KV dtype and block size；
per-worker memory reserve；
prefill/decode scheduling policy；
admission control rule；
timeout and cancellation behavior；
streaming protocol；
observability metrics；
fallback or overload behavior。
request classes and SLOs；
routing compatibility keys；
per-request trace schema；
capacity-test envelope；
KV allocator leak tests and block-table invariants；
cancellation, timeout, and client-disconnect cleanup paths。

训练笔记里常说“把模型部署起来”，但真正的 serving 是一个资源管理问题。模型质量决定回答上限；runtime 设计决定这个上限能不能在真实流量下稳定交付。