LLM Inference Walkthrough — Tensor Shapes and Core Formulas Across the Whole Pipeline

Lay out a Llama 3-style dense decoder-only model end to end from embedding to sampling, embedding the math positions of common variants (MHA / MQA / GQA / MLA, RoPE / ALiBi, RMSNorm / LayerNorm, SwiGLU / GeGLU, Flash Attention / Paged Attention) along the way — so that after reading you can draw this diagram from memory. One fact frames the whole article: this architecture has barely changed in twenty years; every “inference optimization” is a local surgery somewhere on the same skeleton.

Notation Conventions — Llama 3 8B as the Reference

The same notation is used throughout, with Llama 3 8B as the concrete example:

Symbol	Meaning	Example value (Llama 3 8B)
$B$	batch size	2
$S$	prompt length	10
$L$	number of layers	32
$H$	hidden dim	4096
$V$	vocab size	128256
$n_q$	number of Q heads	32
$n_{kv}$	number of KV heads (GQA)	8
$d$	per-head dim $= H/n_q$	128
$I$	FFN intermediate dim	14336
$T$	number of generated tokens	100
$t$	current decode step	$1..T$

All shape annotations follow PyTorch convention, written as [B, ..., H]; weight matrices follow “in × out”, written as $W \in \mathbb{R}^{[\text{in}, \text{out}]}$ .

Core Formulas Quick Reference — Embedding · Norm · Attn · FFN · LM Head

Embedding

\mathbf{x}_i = E[\text{token\_id}_i] \in \mathbb{R}^{H}

$E$ has shape $[V, H]$ and is essentially a lookup table. Many implementations share $E$ with the LM Head’s $W_{\text{lm}}$ (tied embedding), saving memory and providing mild regularization; Llama models do not share by default.

Normalization: LayerNorm vs RMSNorm

Standard LayerNorm:

\text{LN}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sqrt{\sigma^{2} + \epsilon}} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}, \quad \mu = \tfrac{1}{H}\sum x_i,\ \sigma^{2} = \tfrac{1}{H}\sum (x_i - \mu)^{2}

RMSNorm (the mainstream choice for Llama / Mistral / Qwen):

\text{RMSNorm}(\mathbf{x}) = \frac{\mathbf{x}}{\sqrt{\frac{1}{H}\sum_{i=1}^{H} x_i^{2} + \epsilon}} \odot \boldsymbol{\gamma}

RMSNorm drops the mean and $\boldsymbol{\beta}$ , roughly halving both compute and parameters; empirically the quality cost is nearly zero. All mainstream inference engines organize this as pre-norm: norm lives inside the residual branch, and the residual main path bypasses it.

Q/K/V Projection + Positional Encoding

Q = X W_Q, \quad K = X W_K, \quad V = X W_V

RoPE (Rotary Positional Embedding) applies a position- $m$ rotation matrix to every two-dim group of Q and K:

\text{RoPE}(\mathbf{q}_m, m) = \begin{pmatrix} \cos m\theta_k & -\sin m\theta_k \\ \sin m\theta_k & \phantom{-}\cos m\theta_k \end{pmatrix} \begin{pmatrix} q_{2k} \\ q_{2k+1} \end{pmatrix}, \quad \theta_k = \text{base}^{-2k/d}

$\text{base}$ is usually 10000; long-context models (Llama 3.1 128K, Qwen2.5-1M) typically use YaRN / NTK-aware scaling to dynamically enlarge $\text{base}$ or apply band-wise scaling to $\theta_k$ . RoPE acts only on Q and K, not on V — V is the weighted value and doesn’t need positional information.

ALiBi (used by BLOOM and MPT) takes a different route: it does not add position into embeddings; instead it adds a linear bias on the attention scores:

\text{scores}_{ij} = \frac{\mathbf{q}_i^{\top}\mathbf{k}_j}{\sqrt{d}} - m_h \cdot (i - j)

$m_h$ is a per-head slope constant. ALiBi extrapolates naturally, but its ceiling is below RoPE+YaRN, so post-2024 it’s essentially unused.

Scaled Dot-Product Attention

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d}} + M\right) V

$M$ is a causal mask, with the upper triangle set to $-\infty$ , ensuring position $i$ can only see positions $\le i$ . Dividing by $\sqrt{d}$ keeps softmax out of the near-zero-gradient region.

Multi-Head Variants: MHA / MQA / GQA / MLA

Variant	$n_{kv}$	KV cache savings	Representative models
MHA	$= n_q$	1×	GPT-2/3, Llama 1/2 7B
GQA	grouped $< n_q$	$n_q / n_{kv}$	Llama 3, Qwen2, Mistral
MQA	$= 1$	$n_q$	PaLM, Falcon
MLA	low-rank compressed	$\sim n_q$	DeepSeek V2/V3

GQA’s math: $n_{kv}$ groups of K and V, each group shared by $n_q/n_{kv}$ Q heads; kernels typically implement this by broadcasting, not by actually replicating tensors.

MLA (Multi-Head Latent Attention) compresses K and V jointly into a low-rank latent vector $\mathbf{c}^{KV}$ :

\mathbf{c}^{KV}_i = W^{DKV} \mathbf{x}_i, \quad \mathbf{k}^{C}_i = W^{UK}\mathbf{c}^{KV}_i, \quad \mathbf{v}^{C}_i = W^{UV}\mathbf{c}^{KV}_i

Only $\mathbf{c}^{KV}$ (dimension $d_c$ , typically 512) is cached rather than all heads’ K and V. The cost is more complex attention computation (RoPE must go through a separate $\mathbf{k}^{R}$ branch), but per-token cache shrinks from GQA’s several KB to a few hundred bytes.

Output Projection + Residual

\mathbf{h} = \mathbf{x} + \text{Attn}(Q', K', V') \cdot W_O

Note that the residual adds the attention sublayer’s input $\mathbf{x}$ (the value before pre-norm), not the post-norm value.

FFN Variants

Classic bilinear FFN (GPT-2):

\text{FFN}(\mathbf{x}) = \phi(\mathbf{x} W_1) W_2

$\phi$ is typically GeLU. The GeLU approximation:

\text{GeLU}(x) \approx 0.5 x \left(1 + \tanh\!\left[\sqrt{\tfrac{2}{\pi}}\left(x + 0.044715 x^{3}\right)\right]\right)

The GLU family (used by Llama, PaLM, Mistral):

\text{GLU}(\mathbf{x}) = \big(\phi(\mathbf{x} W_{\text{gate}}) \odot (\mathbf{x} W_{\text{up}})\big) W_{\text{down}}

$\phi = \sigma$ : GLU
$\phi = \text{GeLU}$ : GeGLU
$\phi = \text{SiLU}$ : SwiGLU
$\phi = \text{ReLU}$ : ReGLU

where SiLU is defined as $\text{SiLU}(z) = z \cdot \sigma(z)$ . SwiGLU has one more projection than the classic GeLU-FFN (three matrices vs two); to align parameter budget, implementations typically set $I = \tfrac{2}{3} \cdot 4H$ ( $4H$ is the GPT-2 convention), and Llama 3 8B’s $I = 14336 \approx \tfrac{2}{3}\cdot 4\cdot 4096 \cdot 1.3$ .

Residual

\mathbf{x}_{\text{out}} = \mathbf{h} + \text{FFN}(\text{RMSNorm}(\mathbf{h}))

LM Head + Sampling

\text{logits} = \mathbf{x}_{\text{final}} W_{\text{lm}}, \quad \mathbf{p} = \text{softmax}(\text{logits}/T)

$T$ is temperature. Several logits transformations are typically applied before sampling:

Repetition penalties (repetition / frequency / presence penalty):

\text{logits}'_v = \text{logits}_v - \alpha \cdot \mathbb{1}[v \in \text{history}] - \beta \cdot \text{count}(v)

Top-k: keep the largest $k$ logits, set the rest to $-\infty$ .

Top-p (nucleus): sort by probability descending, keep the smallest set whose cumulative probability $\le p$ .

Min-p: keep tokens with $p_v \ge p_{\min} \cdot p_{\max}$ ; friendlier to low-entropy distributions.

Typical-p: truncate by deviation from the conditional entropy, keeping the set where $|-\log p_v - H(\mathbf{p})|$ is small.

All truncations act before/after the probability distribution itself, without changing the skeleton of the formula.

Prefill Stage Shape Transitions — S Tokens Walk Through Once

Input: input_ids [B, S] = [2, 10]. The figure below shows the forward pass of one Transformer layer, wrapped in 32 layers; the shape stays at [B, S, H] = [2, 10, 4096] throughout, with residual edges shown as orange dashed lines.

Prefill — shape transitions and core formulas of a single Transformer layer. Orange dashed lines are pre-norm residuals; ⊕ marks residual merges; the left bracket marks ”× 32 layers.”

After prefill, the KV Cache state: each layer holds the first 10 positions.

Decode Stage Shape Transitions — Step t Processes Only 1 Token

Prior state: $\text{cache\_len} = S + t - 1$ positions filled.

Input: input_ids [B, 1] = [2, 1] (the 1 token generated at the previous step).

Decode — step t processes only 1 token, with core formulas; the KV Cache keeps growing; the blue dashed line shows the generated next_token feeding back as the next step’s input.

Prefill vs Decode Shape Comparison — GEMM vs GEMV · Compute vs Bandwidth

Position	Prefill	Decode (per step)
input_ids	$[B, S]$	$[B, 1]$
after embedding	$[B, S, H]$	$[B, 1, H]$
Q	$[B, n_q, S, d]$	$[B, n_q, 1, d]$
K_new / V_new	$[B, n_{kv}, S, d]$	$[B, n_{kv}, 1, d]$
K_full / V_full (from cache)	same as K_new	$[B, n_{kv}, \text{cache\_len}, d]$
attention scores	$[B, n_q, S, S]$	$[B, n_q, 1, \text{cache\_len}]$
attention output	$[B, n_q, S, d]$	$[B, n_q, 1, d]$
FFN intermediate	$[B, S, I]$	$[B, 1, I]$
logits	$[B, V]$ (last position)	$[B, V]$
Operation type	GEMM (matrix × matrix)	GEMV (matrix × vector)
Bottleneck	compute	memory bandwidth

This table is the starting point for understanding every inference acceleration effort: prefill is like training’s forward, compute-bound; decode is a chain of GEMVs, memory-bound, with most time spent fetching weights into SMs. The optimization directions of the two are worlds apart.

KV Cache Shape and Growth — Few KB Per Token · Hundreds of MB at Long Context

One pair of caches per layer:

K_{\text{cache}}, V_{\text{cache}} \in \mathbb{R}^{B \times n_{kv} \times S_{\max} \times d}

Per-token, per-layer cache size (fp16):

2 \times n_{kv} \times d \times 2\ \text{bytes} = 2 \times 8 \times 128 \times 2 = 4\ \text{KB}

Per-token across the full 32-layer model: $4\ \text{KB} \times 32 = 128\ \text{KB} / \text{token}$ . A 4096-token request: $128\ \text{KB} \times 4096 \approx 512\ \text{MB}$ .

Several engineering optimizations:

Paged Attention (vLLM): split the cache into fixed-size blocks (typically 16 tokens), with a block table mapping virtual to physical addresses, eliminating fragmentation. The formulas don’t change; only the tensor layout and access pattern change.
Sliding Window Attention (Mistral): keep only the most recent $W$ tokens of K and V. Cache cap drops from $S_{\max}$ to $W$ , at the cost of information truncation, with long-range dependencies relayed through cross-layer stacking.
INT8 / FP8 KV Cache: quantize fp16 cache down to int8 or even fp8, per-channel or per-token quantization, with controllable error and cache footprint cut by 1/2 to 1/4. Representative work: KIVI / KVQuant.
KV compression / eviction (H2O, StreamingLLM, SnapKV): drop unimportant positions based on attention weights; used at very long context lengths.
MLA: mentioned earlier — modifies cache shape at the model-structure level, not as a postprocess.

Per-Step Compute and Memory Cost — Llama 3 8B fp16 · H100 Knee ~330 FLOPs/byte

The shape diagrams above show shapes but not magnitudes. 90% of inference optimization discussion is about “how many FLOPs does this step cost, how many bytes move,” so let’s lay each step’s cost into tables directly.

Using Llama 3 8B, fp16, $B=1$ as the baseline; for prefill take $S=2048$ ; for decode take $\text{cache\_len}=2048$ (some step during generation around the 2048th token).

Reference hardware knee: H100 SXM fp16 theoretical compute ~989 TFLOPs, HBM bandwidth ~3 TB/s, roofline knee $\text{AI}^{*} \approx 330\ \text{FLOPs/byte}$ . Above it is compute-bound, below is memory-bound.

Weight Distribution

Component	Shape	fp16 size	full model (× 32 layers)
Embedding $E$	$[V, H]$	1.0 GB	1.0 GB
$W_Q$	$[H, H]$	32 MB	1.0 GB
$W_K$	$[H, n_{kv}d]$	8 MB	256 MB
$W_V$	$[H, n_{kv}d]$	8 MB	256 MB
$W_O$	$[H, H]$	32 MB	1.0 GB
$W_{\text{gate}}$	$[H, I]$	117 MB	3.7 GB
$W_{\text{up}}$	$[H, I]$	117 MB	3.7 GB
$W_{\text{down}}$	$[I, H]$	117 MB	3.7 GB
RMSNorm $\boldsymbol{\gamma}$ (2 per layer)	$[H]\times 2$	16 KB	500 KB
LM head $W_{\text{lm}}$	$[H, V]$	1.0 GB	1.0 GB
Total		~432 MB / layer	~16 GB

Full-model fp16 weights are ~16 GB; the “floor price” of every forward pass is to scan these 16 GB from HBM. At H100’s 3 TB/s, $= 16/3000 \approx 5.3\ \text{ms}$ — this is the physical lower bound of single-request decode.

Per-Layer, Per-Step Compute / Memory I/O

Compare the same layer’s substeps under prefill ( $N=S$ ) and decode ( $N=1$ ). “Weight HBM” is the weight bytes fetched from VRAM; “KV HBM” is the KV-cache bytes read/written. Intermediate activations are assumed fused into kernels and not counted separately.

Step	Prefill FLOPs (S=2048)	Decode FLOPs (S=1)	Weight HBM	KV HBM
RMSNorm	$5BSH$ ≈ 42 MF	20 KF	$\boldsymbol{\gamma}$ 8 KB	—
$Q_{\text{proj}}$	$2BSH^{2}$ ≈ 68.7 GF	33.5 MF	$W_Q$ 32 MB	—
$K_{\text{proj}}$ (+write cache)	$2BSH \cdot n_{kv}d$ ≈ 17.2 GF	8.4 MF	$W_K$ 8 MB	W 4 MB / 2 KB
$V_{\text{proj}}$ (+write cache)	17.2 GF	8.4 MF	$W_V$ 8 MB	W 4 MB / 2 KB
RoPE	~50 MF	25 KF	—	—
Attn $QK^{\top}$	$2B n_q N L_k d$ ≈ 34.4 GF	16.8 MF	—	R 4 MB (decode)
softmax	~700 MF	260 KF	—	—
Attn $\cdot V$	34.4 GF	16.8 MF	—	R 4 MB (decode)
$W_O$	$2BSH^{2}$ ≈ 68.7 GF	33.5 MF	$W_O$ 32 MB	—
RMSNorm	42 MF	20 KF	$\boldsymbol{\gamma}$ 8 KB	—
$W_{\text{gate}}$	$2BSHI$ ≈ 241 GF	117 MF	$W_{\text{gate}}$ 117 MB	—
$W_{\text{up}}$	241 GF	117 MF	$W_{\text{up}}$ 117 MB	—
SiLU + gate	~90 MF	45 KF	—	—
$W_{\text{down}}$	$2BSIH$ ≈ 241 GF	117 MF	$W_{\text{down}}$ 117 MB	—
Per-layer total	~960 GFLOPs	~470 MFLOPs	~432 MB	W 8 MB (P) / R 8 MB (D)

A few direct conclusions:

FFN is the real protagonist. $W_{\text{gate}} + W_{\text{up}} + W_{\text{down}}$ consume ~75% of FLOPs and ~80% of weight bandwidth. MoE, sparse activation, and FFN quantization all target this block.
The 4 attention projections (Q/K/V/O) account for ~18%; the actual $QK^{\top}$ and $\cdot V$ only ~7% — in prefill, attention isn’t the bottleneck — the projections are.
Decode’s KV reads are 8 MB per layer; at $\text{cache\_len}=2048$ this is only ~2% of weight reads. But once context stretches to 64K or 128K, it grows tens of times, overtaking weight bandwidth as the new bottleneck (this is why Paged Attention, sliding window, and KV quantization exist).

One Full Forward Pass

Adding 32 layers + embedding + LM head:

Stage	FLOPs	HBM I/O	Arithmetic Intensity	Bottleneck
Prefill S=2048, B=1	~31 TFLOPs	~14 GB (weights) + 256 MB (KV write)	~2200 FLOPs/byte	compute
Decode step, cache_len=2048, B=1	~15 GFLOPs	~14 GB (weights) + 256 MB (KV read)	~1.05 FLOPs/byte	bandwidth
LM head (prefill, last position only)	~1 GFLOP	1 GB	~1 FLOPs/byte	bandwidth
LM head (decode)	~1 GFLOP	1 GB	~1 FLOPs/byte	bandwidth

Decode’s 1.05 FLOPs/byte is 2.5 orders of magnitude below H100’s knee of 330 — meaning ideal single-request decode compute utilization is only $1.05/330 \approx 0.3\%$ . This is the mathematical basis for continuous batching: push $B$ to 32 so the same weight read is amortized across 32 requests; arithmetic intensity scales by 32×, decode throughput grows almost linearly until the attention portion or compute itself becomes the wall.

Mental-Math Rules

Two rules cover 90% of inference performance estimation:

FLOPs ≈ $2 P N$ : $P$ is the parameter count (~8B); $N$ is the total number of tokens this forward pass processes. Each parameter is used once per token (one MAC = 2 FLOPs). E.g., prefill $S=2048$ : $2 \times 8\text{B} \times 2048 \approx 33\ \text{TFLOPs}$ , matching the itemized sum of 31 TFLOPs.
Weight HBM I/O ≈ $2 P$ bytes (fp16): one forward pass scans the model once, about 16 GB.

Arithmetic intensity is essentially $\frac{2 P N}{2 P} = N$ — the total number of tokens participating in this forward. Prefill has $S \cdot B$ tokens; decode has only $B$ . This single number directly determines why prefill and decode have different bottlenecks.

Compute Complexity Overview — With vs Without KV Cache Differ by Three Orders of Magnitude

Prefill (process $S$ tokens at once):

\text{FLOPs} \sim \underbrace{O(L \cdot S \cdot H^{2})}_{\text{linear layers}} + \underbrace{O(L \cdot S^{2} \cdot H)}_{\text{attention}}

Decode per step (process 1 token, history $\text{cache\_len}$ ):

\text{FLOPs} \sim \underbrace{O(L \cdot H^{2})}_{\text{linear layers, constant}} + \underbrace{O(L \cdot \text{cache\_len} \cdot H)}_{\text{attention, linear in cache}}

Total complexity to generate $T$ tokens:

\text{FLOPs}_{\text{total}} \sim O\!\left(L \cdot T \cdot H^{2} + L \cdot T \cdot (S + T) \cdot H\right)

Without KV cache: $O(L \cdot (S+T)^{3})$ — a massive difference.

Reading FLOPs and bandwidth together is even more illuminating — that’s what the previous section’s table shows: prefill challenges the compute ceiling; decode challenges the bandwidth ceiling; continuous batching’s point is to fuse $N$ requests’ decodes into one large GEMV, amortizing the weight-fetch cost across $N$ requests, with throughput rising linearly until compute or the attention portion becomes the bottleneck.

How Engineering Optimizations Plug into the Formulas — Flash Attn / Spec Decode / Continuous Batch

Flash Attention: mathematically equivalent to standard attention — the formulas don’t change a single character. Engineering-wise it fuses softmax and matmul into one kernel, updating softmax’s running statistics (max, sum) in a streaming manner over blocks, avoiding writing the $S \times S$ attention matrix back to HBM. Complexity unchanged; memory drops from $O(S^{2})$ to $O(S)$ ; speedup comes mainly from reduced HBM access. FA-2 shifted the partition granularity from heads to query blocks; FA-3 on H100/H200 adds warpgroup MMA + producer-consumer async pipelining.

Flash Decoding: at decode, Flash Attention’s Q has only one row, so kernel parallelism is too low. Flash Decoding splits the $\text{cache\_len}$ dimension of K and V into chunks for parallelism and then does a final log-sum-exp reduction. The formula is the same softmax, just split into two passes.

Speculative Decoding: a small “draft” model generates $k$ tokens sequentially, then the large model verifies them with one prefill over the $k$ positions. Acceptance rule:

\text{accept with prob } \min\!\left(1, \frac{p_{\text{target}}(x)}{p_{\text{draft}}(x)}\right)

The crux is fusing $k$ decode GEMVs into one $k$ -length GEMM, turning the large model’s memory-bound regime back into compute-bound. With expected $\bar k$ accepted tokens per step, throughput scales by $\bar k$ (minus draft overhead). Variants: Medusa (multi-head prediction), EAGLE (feature-level draft), Lookahead Decoding (no draft model).

Continuous Batching (vLLM, TGI): instead of padding prefill to align at boundaries, schedule at the per-step request level. Each step picks a batch of requests in the same phase (prefill or decode), releases finished ones. Mathematically each request is independent; only the ordering changes. Original paper: OSDI’22’s Orca.

Chunked Prefill: split long prompts’ prefill into chunks and mix them with decode requests in the same step, reducing decode latency jitter. No formula changes. The core scheduling primitive of SARATHI / DistServe.

One Sentence Spanning the Whole Process — From Token ID to Next Token

Input token IDs → look up embedding → through $L$ layers (Pre-RMSNorm → Attention with RoPE → residual → Pre-RMSNorm → SwiGLU FFN → residual) → Final RMSNorm → LM Head → logits → sampling. In prefill, $S$ tokens pass in parallel, producing the first token + full KV Cache; in decode, each step inputs 1 token; at attention it reads historical K and V from the cache, while all other operations are per-token independent.

Key Engineering Invariants — 6 Rules Worth Memorizing

The main-line tensor shape is always $[B, S_{\text{current}}, H]$ . Residual structure preserves the dimension; whenever $H$ appears different somewhere, either it’s spread into heads inside attention, or lifted to $I$ inside FFN, and back to $H$ on exit.
K and V, once computed, never change. Because they’re linear projections $W_K, W_V$ applied to the already-fixed input $\mathbf{x}$ , and the causal structure ensures later positions cannot reach back to modify earlier representations. This is the mathematical basis for KV Cache.
Attention is the only cross-token operation; all others (norm, projection, FFN, activation) are per-token independent. So only K and V — the inputs to cross-token operations — need caching; everything else can be computed and discarded immediately.
Decode’s scores shape is $[B, n_q, 1, \text{cache\_len}]$ . The “1” is the Q side (the current new token), and the $\text{cache\_len}$ dim is eliminated when weighted-summing with $V$ , returning to one row.
Decode’s non-attention compute per step is constant; only attention grows linearly with cache length. So the true reason “generation gets slower over time” is that attention’s $\text{cache\_len}$ keeps growing, plus KV Cache pushing the memory footprint against HBM bandwidth limits.
Prefill uses GEMM; decode uses GEMV. This one-letter difference dictates that every inference engine has two kernel sets, two scheduling strategies. Internalize this and no inference-optimization paper will lose you.

Internalize these six and you’ll find that Flash Attention, PagedAttention, MLA, speculative decoding — they’re all local optimizations at some spot on this skeleton, while the skeleton itself has barely changed in twenty years.

References — Formulas · Papers · Engineering Blogs

Architecture and Core Operators

Vaswani et al., “Attention Is All You Need” (NeurIPS 2017) — the original Transformer paper. arxiv.org/abs/1706.03762
Shazeer, “GLU Variants Improve Transformer” (2020) — source of SwiGLU / GeGLU / ReGLU. arxiv.org/abs/2002.05202
Zhang & Sennrich, “Root Mean Square Layer Normalization” (NeurIPS 2019) — RMSNorm. arxiv.org/abs/1910.07467
Hendrycks & Gimpel, “Gaussian Error Linear Units (GELUs)” (2016) — GeLU definition and approximation. arxiv.org/abs/1606.08415

Positional Encoding

Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding” (2021) — RoPE. arxiv.org/abs/2104.09864
Press et al., “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation” (ICLR 2022) — ALiBi. arxiv.org/abs/2108.12409
Peng et al., “YaRN: Efficient Context Window Extension of Large Language Models” (2023) — long-context RoPE scaling. arxiv.org/abs/2309.00071
bloc97 & emozilla, “NTK-Aware Scaled RoPE” — discussion of early NTK-aware open-source work. reddit / LocalLLaMA

Shazeer, “Fast Transformer Decoding: One Write-Head is All You Need” (2019) — MQA. arxiv.org/abs/1911.02150
Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints” (EMNLP 2023) — GQA. arxiv.org/abs/2305.13245
DeepSeek-AI, “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model” (2024) — MLA introduced. arxiv.org/abs/2405.04434
DeepSeek-AI, “DeepSeek-V3 Technical Report” (2024) — MLA + MoE engineering. arxiv.org/abs/2412.19437

Flash Attention Series

Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” (NeurIPS 2022) — FA-1. arxiv.org/abs/2205.14135
Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning” (2023) — FA-2. arxiv.org/abs/2307.08691
Shah et al., “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision” (NeurIPS 2024) — FA-3 / Hopper. arxiv.org/abs/2407.08608
Dao et al., “Flash-Decoding for long-context inference” (Stanford / Together blog, 2023) — Flash Decoding. crfm.stanford.edu

Inference Engines and Serving Schedulers

Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (SOSP 2023) — vLLM / PagedAttention. arxiv.org/abs/2309.06180
Yu et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models” (OSDI 2022) — original Continuous Batching paper. usenix.org/osdi22
Agrawal et al., “SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills” (2023) — chunked prefill. arxiv.org/abs/2308.16369
Zhong et al., “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving” (OSDI 2024) — prefill/decode disaggregation. arxiv.org/abs/2401.09670
vLLM main repo (PagedAttention engineering implementation). github.com/vllm-project/vllm
Hugging Face Text Generation Inference (TGI). github.com/huggingface/text-generation-inference
NVIDIA TensorRT-LLM documentation (FA / in-flight batching). nvidia.github.io/TensorRT-LLM

Speculative Decoding Family

Leviathan, Kalman, Matias, “Fast Inference from Transformers via Speculative Decoding” (ICML 2023). arxiv.org/abs/2211.17192
Chen et al., “Accelerating Large Language Model Decoding with Speculative Sampling” (DeepMind, 2023) — concurrent independent work. arxiv.org/abs/2302.01318
Cai et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads” (2024). arxiv.org/abs/2401.10774
Li et al., “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty” (ICML 2024). arxiv.org/abs/2401.15077
Fu et al., “Lookahead Decoding: Breaking the Sequential Dependency of LLM Inference” (2024). arxiv.org/abs/2402.02057

KV Cache Compression / Quantization

Xiao et al., “Efficient Streaming Language Models with Attention Sinks” (ICLR 2024) — StreamingLLM. arxiv.org/abs/2309.17453
Zhang et al., “H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models” (NeurIPS 2023). arxiv.org/abs/2306.14048
Li et al., “SnapKV: LLM Knows What You are Looking for Before Generation” (2024). arxiv.org/abs/2404.14469
Liu et al., “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache” (ICML 2024). arxiv.org/abs/2402.02750
Hooper et al., “KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization” (NeurIPS 2024). arxiv.org/abs/2401.18079

Representative Open-Source Model Technical Reports

Meta AI, “The Llama 3 Herd of Models” (2024) — Llama 3 family. arxiv.org/abs/2407.21783
Jiang et al., “Mistral 7B” (2023) — Sliding Window Attention. arxiv.org/abs/2310.06825
Qwen Team, “Qwen2.5 Technical Report” (2024). arxiv.org/abs/2412.15115
Workshop et al., “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model” (2022) — ALiBi application. arxiv.org/abs/2211.05100
MosaicML, “MPT-7B” (2023) — ALiBi long context. databricks.com / MPT-7B

Hardware / Roofline

NVIDIA, “H100 Tensor Core GPU Architecture Whitepaper” (2022). resources.nvidia.com
Williams, Waterman, Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures” (CACM 2009) — original Roofline paper. dl.acm.org
Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla, 2022) — training-side version of the $2PN$ FLOPs rule of thumb. arxiv.org/abs/2203.15556