The Compute Metrics Landscape: From FLOPs to MFU, Every Number Explained Through Three Generations of Flagship GPUs

The easiest trap when talking about performance is treating a handful of nearly-identical-looking words as synonyms: FLOP, FLOPs, FLOPS, MAC, TOPS… and they differ by far more than a hair. The lowercase-s FLOPs is a “count” (the amount of work); the uppercase-S FLOPS is a “count per second” (compute throughput) — one describes how big a task is, the other how fast a machine is, with a time dimension separating them. Mix the two up and you end up with absurdities like “this card does 2250 TFLOPS, my model is 175 GFLOPs, so it finishes in 0.08 ms” (the real world is several orders of magnitude slower).

Below I lay them out in five categories — compute volume, compute throughput, memory access, efficiency, and deployment — with multiple worked examples per metric, and wherever possible cross-referenced against the last three generations of datacenter flagships (Ampere A100, Hopper H100, Blackwell B200, plus 2026’s Rubin), so the numbers can be compared side by side and read more clearly. Let’s start with a panoramic table.

Category	Metric	Meaning	Unit	Key reminder
Compute volume	FLOP / FLOPs	Total floating-point operations (work)	count (GFLOPs/TFLOPs)	lowercase s = count, don’t confuse with throughput
Compute volume	MAC / MACs	Multiply-accumulate count, the core DL op	count	1 MAC ≈ 2 FLOPs
Compute volume	Params	Number of model parameters (weights)	count (M/B)	drives memory footprint, ≠ compute volume
Compute throughput	FLOPS	Floating-point operations per second (throughput)	TFLOPS	uppercase S = per Second
Compute throughput	Peak FLOPS	Hardware’s theoretical throughput ceiling	TFLOPS	tightly tied to precision, watch for sparsity doubling
Compute throughput	TOPS	Trillion operations per second, usually integer	TOPS	common for INT8 inference / edge chips
Memory access	Memory bandwidth	Memory read/write speed	GB/s, TB/s	LLM decode is usually stuck on this
Memory access	Memory capacity	How large a model/batch fits	GB	decides feasibility
Memory access	Bytes moved	Bytes an operator reads/writes	Bytes	used together with arithmetic intensity
Efficiency	Arithmetic intensity	FLOPs ÷ Bytes, ops per byte	FLOP/Byte	tells whether the bottleneck is compute or bandwidth
Efficiency	Roofline	Analytical model of arithmetic intensity vs. measured throughput	—	distinguishes compute- / memory-bound
Efficiency	GPU utilization	Fraction of time a kernel is running	%	does not mean compute is saturated, easily misread
Efficiency	MFU / HFU	Measured effective throughput ÷ peak throughput	%	the reliable metric for LLM training, 50% is already good
Deployment	Latency / throughput	Single-response time / processed volume per unit time	ms / req·s⁻¹	larger batch: throughput↑ latency↑
Deployment	Efficiency / cost-efficiency	Throughput per watt, throughput per dollar	TFLOPS/W etc.	matters more than peak for large-scale deployment

To give every worked example below a single common yardstick, let me first list the key specs of these generations of flagships together (FP16/BF16 are dense Tensor Core throughput):[1]

Generation	Flagship	BF16 dense peak	Lowest-precision peak	Memory	Memory bandwidth	TDP
Ampere (2020)	A100	312 TFLOPS	624 TOPS (INT8)	80 GB HBM2e	2.0 TB/s	400 W
Hopper (2022)	H100	990 TFLOPS	1979 TFLOPS (FP8)	80 GB HBM3	3.35 TB/s	700 W
Blackwell (2024)	B200	2250 TFLOPS	9000 TFLOPS (FP4)	192 GB HBM3e	8.0 TB/s	1000 W
Rubin (2026)	R100	~8000 TFLOPS	~50000 TFLOPS (FP4)	288 GB HBM4	~13 TB/s	—

FLOP / FLOPs — Compute volume · floating-point operation count

One floating-point addition and one floating-point multiplication each count as one FLOP. The total number of floating-point operations the whole network performs in a single forward pass is its FLOPs. Note that it is an absolute count, with no notion of time — it depends only on the model and the input, not on what hardware you use.

Example: a single CNN forward pass — count it directly with a formula

For a convolutional layer (output feature map $H\times W$ , input/output channels $C_{in}/C_{out}$ , kernel $K$ ):

\text{FLOPs}=H\times W \times (K\times K \times C_{in}+1)\times C_{out}

ResNet-50 at 224×224 input is about 4.1 GMACs ≈ 8.2 GFLOPs. But many papers write “ResNet-50 ≈ 4 GFLOPs” — what they actually report is MACs, while calling it FLOPs (the next section untangles this confusion specifically).

Example: Transformer training — the 6ND rule of thumb

The total compute for training a large model has an extremely handy empirical estimate: $\text{FLOPs} \approx 6ND$ , where $N$ is the parameter count and $D$ is the number of training tokens (2 for the forward pass, 4 for the backward, 6 in total).[2] Training GPT-3 175B on 300B tokens:

6ND = 6\times175\times10^9\times300\times10^9 \approx 3.15\times10^{23}\ \text{FLOPs}

Training Llama-3.1 405B on roughly 15.6T tokens gives $6ND \approx 3.8\times10^{25}$ FLOPs — two orders of magnitude larger than GPT-3. This absolute number carries no time by itself; you have to divide by throughput to know how long it takes, as in the FLOPS section below.

Example: prefill for a single inference — 2N per token

At inference, the compute for one forward pass is about $2N$ FLOPs/token ( $N$ being the parameter count). A 70B model processing a 1000-token prompt (the prefill stage) does roughly $2\times70\times10^9\times1000 \approx 1.4\times10^{14}$ FLOPs. Later this number, together with “bytes moved,” is used to compute the arithmetic intensity of prefill.

MAC / MACs — Compute volume · multiply-accumulate

The core operation of deep learning is “multiply once, then add once” (a*b+c), called one MAC (Multiply-Accumulate), and in hardware it is often done by a single FMA instruction.

Example: the 1 MAC ≈ 2 FLOPs conversion

One MAC contains one multiply and one add, so 1 MAC ≈ 2 FLOPs. MobileNetV1 is nominally about 569 MMACs, which converts to ~1.14 GFLOPs. If you take that 569M figure and compare it directly against GFLOPs reported elsewhere, you’d mistakenly think it’s half its actual size.

Example: the “4G vs. 8G” mystery in papers

When academia reports model cost, MACs and FLOPs are frequently mixed up: for ResNet-50 you’ll see both a “4.1G” and an “8.2G” version, differing by exactly this factor of 2 — the former is MACs, the latter FLOPs. Some papers also use the term “Mult-Adds,” which again means MACs. Confirm the convention before reading a number, otherwise every cross-comparison is wrong.

Example: Tensor Cores are essentially MAC arrays

Why do hardware vendors pile up “multiply-accumulate” units instead of general-purpose floating point? Because matrix multiplication is a massive amount of MAC. A Tensor Core is a systolic array purpose-built for $D = A\times B + C$ , swallowing the multiply-accumulate of an entire small matrix block in one clock cycle. This is why the H100’s Tensor Core FP16 throughput (990 TFLOPS) is nearly 15× its CUDA Core FP32 (67 TFLOPS) — it pours the entire general-compute budget into MAC.[1]

Params — Compute volume · parameter count

The number of learnable weights in a model. It determines storage and memory footprint, not compute volume — neither can be inferred from the other.

Example: Llama-3.1 405B doesn’t fit on one card

In FP16 each parameter is 2 bytes, so 405B parameters require $405\times10^9\times2 \approx 810\ \text{GB}$ for weights alone. Compare with the table above: an 80GB A100/H100 is off by a factor of ten; even the 192GB B200 needs 5 cards and the 288GB Rubin needs 3 just to hold the weights — and that’s before KV cache and activations. Params directly determine “at least how many cards you need,” and have nothing to do with how much it can compute per second.

Example: DeepSeek-V3 — parameter count ≠ per-token compute

MoE (mixture-of-experts) models fully decouple Params from compute volume. DeepSeek-V3 has 671B total parameters but activates only about 37B per token. So its memory footprint is sized by 671B (it all has to fit), while its per-token compute (that $2N$ ) is sized by only 37B. Estimating its inference throughput from the total parameter count overestimates by nearly 20×.

Example: weights aren’t the only thing eating capacity — KV cache

Besides Params, memory also holds the KV cache (every request, every already-generated token must be cached). With long context and large batches, the KV cache can balloon to the same order of magnitude as the weights, directly squeezing the usable batch. So Params set the floor and KV cache sets how much concurrency you can still stuff in — together they eat capacity.

FLOPS — Compute throughput · floating-point operations per second

Divide the “count” above by time and you enter the world of compute throughput. The core distinction: uppercase S = per Second. $\text{FLOPS} = \text{FLOPs} \div \text{time}$ .

Example: how long does GPT-3 training actually take

Divide the task’s FLOPs by the cluster’s throughput to get the ideal runtime. GPT-3’s $3.15\times10^{23}$ FLOPs, on 1024 A100s (312 TFLOPS each) at 100% utilization:

T = \frac{3.15\times10^{23}}{1024\times3.12\times10^{14}} \approx 9.9\times10^5\ \text{s} \approx 11\ \text{days}

But real MFU is only thirty or forty percent (see below), so in practice it’s on the order of a month. This is the standard way FLOPs (the task) and FLOPS (the throughput) are used together.

Example: the same task, theoretical time on three generations

Running the same $10^{22}$ FLOPs training task, the theoretical single-card full-load time shrinks linearly with peak throughput: A100 (312 TFLOPS) about 8.9 hours, H100 (990) about 2.8 hours, B200 (2250) about 1.2 hours. The nearly 7× gap across three generations comes almost entirely from the rise in Tensor Core peak — provided the workload can really saturate the compute, which is exactly the question Roofline answers.

Peak FLOPS — Compute throughput · theoretical peak

The throughput ceiling a vendor quotes. When you look at this number, always keep an eye on two things: which precision, and whether sparsity doubling is included.

Example: the precision ladder — five peaks for the same H100

On the same chip, each step down in precision roughly doubles the peak. On a single H100: FP64 about 67 TFLOPS, TF32 about 495, FP16/BF16 990, FP8 about 1979 TFLOPS. So “what’s the H100’s throughput” has no single answer — you must first ask which precision. Low-precision training/inference is the mainline precisely because it directly doubles the usable peak.[3]

Example: the sparsity-doubling trap — where B200’s “9000” comes from

Marketing material often prints “B200 FP4 throughput 9000 TFLOPS,” but that’s the figure with 2:4 structured sparsity turned on; dense FP4 is only about 4500 TFLOPS. Only workloads that satisfy the sparsity pattern can claim the sparse peak; dense matrix multiplication is genuinely halved. Comparing a sparse peak against someone else’s dense peak comes with a built-in 2× inflation.

Example: cross-generation peak — 2380× in ten years

Zoom out: from P100 (2016) to Rubin R100 (2026), AI peak throughput grew about 2380×, while general-purpose CUDA Core FP32 throughput over the same period grew only about 10×.[1] The scissor gap between these two curves shows that the decade’s compute explosion happened almost entirely along the “specialized matrix multiply + low precision” line, not because general-purpose floating point got faster.

TOPS — Compute throughput · trillion operations per second

Structurally identical to FLOPS, but “Operations” usually means integer operations (especially INT8), most common in the spec sheets of inference accelerators and edge chips. Use FLOPS for floating point and TOPS for fixed-point/integer; with the unit changed, don’t compare magnitudes directly.

Example: datacenter INT8 — A100 to B200

The A100 is rated at 624 TOPS (INT8), the H100 about 1979 TOPS (INT8 dense), and the B200 higher still. INT8 quantized inference is very cost-effective in scenarios less sensitive to precision such as classification, retrieval, and recommendation — trading an integer pipeline for nearly double the throughput.

Example: edge chips — Jetson Orin and Thor

Edge spec sheets look almost exclusively at TOPS. The Jetson AGX Orin is rated at 275 TOPS (INT8), and 2025’s Jetson Thor jumps to roughly 2070 TFLOPS (FP4) — bringing the datacenter’s low-precision path down to robotics and automotive. Note that Orin reports TOPS (integer) while Thor reports FP4 TFLOPS (floating point); different conventions, not directly subtractable.

Memory bandwidth — Memory access · Memory Bandwidth

How many bytes per second can move between memory and the compute units. LLM autoregressive decode is almost always stuck on this: to generate each token you must read the entire model’s weights from memory once, which is pure data movement — no matter how fast you compute, you have to wait for the data to arrive.

Example: single-stream 70B decode, three generations of bandwidth set the ceiling

Generating one token requires reading the weights at least once. An FP16 70B model’s weights are 140GB; dividing that by bandwidth gives the theoretical lower bound per token:

GPU	Bandwidth	Reading 140GB of weights	Ceiling tok/s
A100	2.0 TB/s	70 ms	~14
H100	3.35 TB/s	42 ms	~24
B200	8.0 TB/s	17.5 ms	~57

This column of numbers is determined entirely by bandwidth and has nothing whatsoever to do with peak throughput (312 / 990 / 2250 TFLOPS) — A100 to B200’s roughly 4× decode speedup is owed entirely to bandwidth growing 4×.

Example: why quantization speeds up decode

Since the decode bottleneck is “the number of bytes read for the weights,” shrinking the weights directly speeds it up. Take the 70B above from FP16 to FP8, and the weights drop from 140GB to 70GB; the bytes read per token are halved, and decode throughput nearly doubles — this has nothing to do with compute, it’s purely moving half as much data. FP4 halves it again. This is the core value of low-precision inference under the bandwidth wall.

Memory capacity — Memory access · Memory Capacity

How large a model plus batch/KV cache can fit. It decides “feasible / infeasible,” not fast or slow.

Example: the largest dense model each generation fits on one card

Roughly figuring on FP16 (2 bytes/parameter), memory capacity directly frames the weight ceiling a single card can hold: the 80GB A100/H100 ≈ 40B parameters, the 192GB B200 ≈ 96B, the 288GB Rubin ≈ 144B. Anything bigger must be split across cards with tensor parallelism — when capacity isn’t enough, no amount of throughput or bandwidth is even on the table.

Example: long context drains capacity

Capacity isn’t only for weights. A 70B model at 128K context with a large batch can easily have the KV cache eat tens of GB, competing for the same memory as the weights. This is why B200/Rubin pushed single-card capacity from 80GB all the way to 192/288GB — in the era of long context plus high concurrency, capacity itself is a product strength.

Bytes moved — Memory access · Bytes

How many bytes an operator actually reads and writes to complete its task (input + weights + output). It isn’t a KPI on its own; rather, paired with FLOPs, it yields the next section’s most crucial quantity, “arithmetic intensity.”

Example: the difference in bytes moved between decode and prefill

The same 70B model: decoding one token is a matrix-vector product (GEMV), reading 140GB of weights but doing only about $1.4\times10^{11}$ FLOPs; prefilling 1000 tokens is a matrix-matrix product (GEMM), still reading those 140GB of weights (reused across 1000 tokens) but doing about $1.4\times10^{14}$ FLOPs. The bytes moved are almost the same, the compute differs by 1000× — which is exactly why one is memory-bound and the other compute-bound.

Arithmetic intensity — Efficiency · Arithmetic Intensity

A minimal but extremely useful definition:

\text{Arithmetic intensity} = \frac{\text{FLOPs}}{\text{bytes moved}} \quad (\text{FLOP/Byte})

Its physical meaning is “how many operations can be amortized over each byte moved from memory.” High intensity → one data fetch keeps you computing for a long time → compute-bound; low intensity → most of the time is spent waiting on data → bandwidth-bound.

Example: the two extremes of GEMM and GEMV

In a large matrix multiply (GEMM), each element is reused $O(N)$ times, so arithmetic intensity reaches hundreds or even thousands of FLOP/Byte; whereas LLM decode is GEMV, where each weight is used only once, and at batch=1 the arithmetic intensity drops to ~1–2 FLOP/Byte. On the same card, one saturates compute and the other is locked by bandwidth, the difference being only this one ratio.

Example: batch is the knob for raising arithmetic intensity

At decode batch=1 each weight serves one request (intensity ~1); at batch=256 the same weights, read in once, serve 256 requests, and arithmetic intensity rises to ~256 FLOP/Byte — pushing the workload from “the left of the bandwidth wall” past the ridge point toward the compute side. This is the fundamental reason inference serving works so hard to amass batch: not to save compute, but to give otherwise-idle compute something to do.

Roofline — Efficiency · the Roofline model

Plot arithmetic intensity (x-axis) against attainable throughput (y-axis) on a single log-log chart and you get a “roof” curve:

\text{Attainable throughput} = \min\big(\,\text{peak throughput},\ \ \text{bandwidth}\times\text{arithmetic intensity}\,\big)

The left segment is the slope-1 bandwidth roof, the right segment the horizontal compute roof. Where the two meet is the ridge point, whose x-coordinate = peak throughput ÷ bandwidth. A workload to the left of the ridge is memory-bound, to the right compute-bound.

Rooflines for three flagship generations. The compute roof rises each generation (312 → 990 → 2250 TFLOPS), and the sloped bandwidth roof moves up with it; the ridge point always lands in the narrow 150–295 FLOP/Byte band. Single-stream LLM decode (≈1) is forever at the far left, hugging the sloped roof, while large matrix multiplies hug the compute roof at the right.

Example: why three generations’ ridge points stay steady at 150–300

Ridge point = peak throughput ÷ bandwidth. A100: 312 ÷ 2.0 ≈ 153; H100: 990 ÷ 3.35 ≈ 295; B200: 2250 ÷ 8.0 ≈ 281. Across three generations throughput grew 7× and bandwidth grew 4×, yet after dividing the ridge point barely moved — because vendors add throughput and bandwidth together. This implies a plain but brutal conclusion: as long as your workload’s arithmetic intensity is below about 300, switching to any newer card still leaves it memory-bound, and the speedup tracks bandwidth, not peak throughput.

Example: LLM decode is forever at the left end of the roofline

Decode’s arithmetic intensity is ~1, far below any generation’s ridge point (150–295), so it sits at the far left of all three sloped roofs, its ceiling forever bandwidth. This explains at the root why decode optimization all revolves around “read fewer weights / reuse weights more”: quantization (read less), MoE (activate only part), speculative decoding (produce several tokens from one forward pass), and growing the batch (reuse) — not one of them is about piling on compute.

GPU utilization — Efficiency · Utilization

It’s that percentage in nvidia-smi, and it only means “was a kernel running on the GPU during this period.”

Example: the illusion of 99% utilization

Running LLM decode, nvidia-smi sits at 99% year-round, looking like the card is wrung dry. But even if a kernel uses only 5% of the compute units and spends the rest of the time waiting on memory, as long as it’s running, utilization shows 100%. Decode is memory-bound, the GPU spends most of its time waiting on HBM, and effective throughput may be only a few percent of peak — this 99% is a famously misleading number. To judge “how much compute is actually used,” look at MFU below.

MFU / HFU — Efficiency · Model / Hardware FLOPs Utilization

The genuinely reliable efficiency metric for LLM training:

\text{MFU} = \frac{\text{measured effective throughput}}{\text{hardware peak throughput}} = \frac{6ND \,/\, T}{\text{Peak FLOPS} \times \text{number of cards}}

Example: 50% counts as excellent — PaLM and Megatron

Reaching 50% MFU is already quite good. Google’s PaLM 540B training reported about 46% MFU;[4] in industry, large-model training with Megatron on A100/H100 clusters generally lands at 40%–55%. Where did the other half go? Communication, pipeline bubbles, memory-access stalls, non-matmul operators… all the losses Roofline and utilization already discussed. When you see someone report “90% utilization,” first ask whether it’s nvidia-smi utilization or MFU — they differ by an order of magnitude.

Example: the difference between MFU and HFU — recomputation

MFU counts only the useful floating-point operations required by the model’s forward and backward (using $6ND$ in the numerator); HFU also counts extra operations like activation recomputation (recomputing a forward pass to save memory). So HFU ≥ MFU: for the same training run, HFU might be 57% while MFU is only 48%, the gap being the “wasted” work of recomputation.[5] When comparing different works, always confirm which one is being reported.

Latency / throughput — Deployment · Latency / Throughput

Latency is the response time of a single request, throughput is the volume processed per unit time. The two pull against each other through batch size.

Growing the batch: throughput first rises quickly then saturates (weight reads get amortized, gradually turning compute-bound), while latency stays flat at small batches and rises steeply past the sweet spot. An online service, under an SLA latency constraint, pushes the batch up into that “sweet spot” where throughput hasn’t yet saturated and latency hasn’t yet run away.

Example: the give-and-take of batch=1 vs. batch=32

Continuing the earlier H100 70B decode example: at batch=1 it’s about 24 tokens/s with 42 ms/token latency; raise the batch to 32 and the weights are read once to serve 32 requests, so total throughput can climb to hundreds of tokens/s, but each request waits for a batch to assemble and queues longer, so latency goes up. Growing the batch trades latency for throughput — and vice versa.

Example: TTFT and TPOT — the two latency segments of a request

LLM serving splits latency into two segments: TTFT (time to first token, set by prefill, compute-bound) and TPOT (time per output token, set by decode, memory-bound). These two have different bottlenecks and different optimization levers — prefill is about compute and parallelism, decode about bandwidth and batch. Lumping them into one “latency” to optimize often pushes in the wrong direction.

Efficiency / cost-efficiency — Deployment · Perf/Watt, Perf/$

How much throughput a watt buys, and how much throughput a dollar buys. At large-scale deployment these two matter far more than peak throughput itself — the datacenter’s bottleneck is usually power and cooling, not affording the cards.

Example: three generations’ efficiency — TFLOPS/W

Using the table’s peaks and TDPs to roughly figure throughput per watt (BF16 dense): A100 312 TFLOPS ÷ 400W ≈ 0.78 TFLOPS/W; H100 990 ÷ 700 ≈ 1.41; B200 2250 ÷ 1000 ≈ 2.25. Three generations nearly tripled efficiency — and this is the math a datacenter actually does: for the same power budget, how much more effective throughput the B200 produces.

Example: why hyperscale clusters only look at perf/$ and perf/W

When you’re deploying tens of thousands of cards, the gap between 1.4 and 2.25 TFLOPS/W translates directly into millions per year in electricity, and into whether it even fits the existing facility’s power and cooling budget. At that point no one is staring at the biggest peak number anymore — within total cost of ownership (TCO), electricity and rack density are often more fatal than a card’s unit price.

Tying it together in one sentence — a single causal chain

These metrics are really one causal chain: how big the task is (FLOPs / MACs / Params) → how fast and large the machine is (FLOPS / bandwidth / capacity) → comparing the two, where the bottleneck is (arithmetic intensity / Roofline) → what fraction is actually used (MFU, not nvidia-smi utilization) → the felt performance and cost once live (latency / throughput / efficiency).

Remember the three things most likely to trip you up: lowercase FLOPs is a count, uppercase FLOPS is a rate; 1 MAC ≈ 2 FLOPs; and — the vast majority of LLM workloads are bound by bandwidth, not compute, so first see which side of the Roofline your workload’s arithmetic intensity lands on, then decide where to put the optimization effort.

References

NVIDIA. NVIDIA H100 Tensor Core GPU Architecture / Blackwell Architecture Technical Brief. White Papers, 2022–2024. See also the spec data compiled in this site’s “A Decade of GPU Architecture Evolution” article.
Kaplan, Jared, et al. Scaling Laws for Neural Language Models. arXiv:2001.08361, 2020.
Micikevicius, Paulius, et al. Mixed Precision Training. ICLR, 2018.
Chowdhery, Aakanksha, et al. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311, 2022.
Korthikanti, Vijay, et al. Reducing Activation Recomputation in Large Transformer Models. MLSys, 2023.
Williams, Samuel, Andrew Waterman, and David Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 2009.

GPU
Roofline
FLOPS
MFU
Compute
NVIDIA
LLM inference

2026 · 06 · 08