The Compute Metrics Landscape: From FLOPs to MFU, Every Number Explained Through Three Generations of Flagship GPUs
The easiest trap when talking about performance is treating a handful of nearly-identical-looking words as synonyms: FLOP, FLOPs, FLOPS, MAC, TOPS… and they differ by far more than a hair. The lowercase-s FLOPs is a “count” (the amount of work); the uppercase-S FLOPS is a “count per second” (compute throughput) — one describes how big a task is, the other how fast a machine is, with a time dimension separating them. Mix the two up and you end up with absurdities like “this card does 2250 TFLOPS, my model is 175 GFLOPs, so it finishes in 0.08 ms” (the real world is several orders of magnitude slower).
Below I lay them out in five categories — compute volume, compute throughput, memory access, efficiency, and deployment — with multiple worked examples per metric, and wherever possible cross-referenced against the last three generations of datacenter flagships (Ampere A100, Hopper H100, Blackwell B200, plus 2026’s Rubin), so the numbers can be compared side by side and read more clearly. Let’s start with a panoramic table.
| Category | Metric | Meaning | Unit | Key reminder |
|---|---|---|---|---|
| Compute volume | FLOP / FLOPs | Total floating-point operations (work) | count (GFLOPs/TFLOPs) | lowercase s = count, don’t confuse with throughput |
| Compute volume | MAC / MACs | Multiply-accumulate count, the core DL op | count | 1 MAC ≈ 2 FLOPs |
| Compute volume | Params | Number of model parameters (weights) | count (M/B) | drives memory footprint, ≠ compute volume |
| Compute throughput | FLOPS | Floating-point operations per second (throughput) | TFLOPS | uppercase S = per Second |
| Compute throughput | Peak FLOPS | Hardware’s theoretical throughput ceiling | TFLOPS | tightly tied to precision, watch for sparsity doubling |
| Compute throughput | TOPS | Trillion operations per second, usually integer | TOPS | common for INT8 inference / edge chips |
| Memory access | Memory bandwidth | Memory read/write speed | GB/s, TB/s | LLM decode is usually stuck on this |
| Memory access | Memory capacity | How large a model/batch fits | GB | decides feasibility |
| Memory access | Bytes moved | Bytes an operator reads/writes | Bytes | used together with arithmetic intensity |
| Efficiency | Arithmetic intensity | FLOPs ÷ Bytes, ops per byte | FLOP/Byte | tells whether the bottleneck is compute or bandwidth |
| Efficiency | Roofline | Analytical model of arithmetic intensity vs. measured throughput | — | distinguishes compute- / memory-bound |
| Efficiency | GPU utilization | Fraction of time a kernel is running | % | does not mean compute is saturated, easily misread |
| Efficiency | MFU / HFU | Measured effective throughput ÷ peak throughput | % | the reliable metric for LLM training, 50% is already good |
| Deployment | Latency / throughput | Single-response time / processed volume per unit time | ms / req·s⁻¹ | larger batch: throughput↑ latency↑ |
| Deployment | Efficiency / cost-efficiency | Throughput per watt, throughput per dollar | TFLOPS/W etc. | matters more than peak for large-scale deployment |
To give every worked example below a single common yardstick, let me first list the key specs of these generations of flagships together (FP16/BF16 are dense Tensor Core throughput):[1]
| Generation | Flagship | BF16 dense peak | Lowest-precision peak | Memory | Memory bandwidth | TDP |
|---|---|---|---|---|---|---|
| Ampere (2020) | A100 | 312 TFLOPS | 624 TOPS (INT8) | 80 GB HBM2e | 2.0 TB/s | 400 W |
| Hopper (2022) | H100 | 990 TFLOPS | 1979 TFLOPS (FP8) | 80 GB HBM3 | 3.35 TB/s | 700 W |
| Blackwell (2024) | B200 | 2250 TFLOPS | 9000 TFLOPS (FP4) | 192 GB HBM3e | 8.0 TB/s | 1000 W |
| Rubin (2026) | R100 | ~8000 TFLOPS | ~50000 TFLOPS (FP4) | 288 GB HBM4 | ~13 TB/s | — |
FLOP / FLOPs — Compute volume · floating-point operation count
One floating-point addition and one floating-point multiplication each count as one FLOP. The total number of floating-point operations the whole network performs in a single forward pass is its FLOPs. Note that it is an absolute count, with no notion of time — it depends only on the model and the input, not on what hardware you use.
Example: a single CNN forward pass — count it directly with a formula
For a convolutional layer (output feature map , input/output channels , kernel ):
ResNet-50 at 224×224 input is about 4.1 GMACs ≈ 8.2 GFLOPs. But many papers write “ResNet-50 ≈ 4 GFLOPs” — what they actually report is MACs, while calling it FLOPs (the next section untangles this confusion specifically).
Example: Transformer training — the 6ND rule of thumb
The total compute for training a large model has an extremely handy empirical estimate: , where is the parameter count and is the number of training tokens (2 for the forward pass, 4 for the backward, 6 in total).[2] Training GPT-3 175B on 300B tokens:
Training Llama-3.1 405B on roughly 15.6T tokens gives FLOPs — two orders of magnitude larger than GPT-3. This absolute number carries no time by itself; you have to divide by throughput to know how long it takes, as in the FLOPS section below.
Example: prefill for a single inference — 2N per token
At inference, the compute for one forward pass is about FLOPs/token ( being the parameter count). A 70B model processing a 1000-token prompt (the prefill stage) does roughly FLOPs. Later this number, together with “bytes moved,” is used to compute the arithmetic intensity of prefill.
MAC / MACs — Compute volume · multiply-accumulate
The core operation of deep learning is “multiply once, then add once” (a*b+c), called one MAC (Multiply-Accumulate), and in hardware it is often done by a single FMA instruction.
Example: the 1 MAC ≈ 2 FLOPs conversion
One MAC contains one multiply and one add, so 1 MAC ≈ 2 FLOPs. MobileNetV1 is nominally about 569 MMACs, which converts to ~1.14 GFLOPs. If you take that 569M figure and compare it directly against GFLOPs reported elsewhere, you’d mistakenly think it’s half its actual size.
Example: the “4G vs. 8G” mystery in papers
When academia reports model cost, MACs and FLOPs are frequently mixed up: for ResNet-50 you’ll see both a “4.1G” and an “8.2G” version, differing by exactly this factor of 2 — the former is MACs, the latter FLOPs. Some papers also use the term “Mult-Adds,” which again means MACs. Confirm the convention before reading a number, otherwise every cross-comparison is wrong.
Example: Tensor Cores are essentially MAC arrays
Why do hardware vendors pile up “multiply-accumulate” units instead of general-purpose floating point? Because matrix multiplication is a massive amount of MAC. A Tensor Core is a systolic array purpose-built for , swallowing the multiply-accumulate of an entire small matrix block in one clock cycle. This is why the H100’s Tensor Core FP16 throughput (990 TFLOPS) is nearly 15× its CUDA Core FP32 (67 TFLOPS) — it pours the entire general-compute budget into MAC.[1]
Params — Compute volume · parameter count
The number of learnable weights in a model. It determines storage and memory footprint, not compute volume — neither can be inferred from the other.
Example: Llama-3.1 405B doesn’t fit on one card
In FP16 each parameter is 2 bytes, so 405B parameters require for weights alone. Compare with the table above: an 80GB A100/H100 is off by a factor of ten; even the 192GB B200 needs 5 cards and the 288GB Rubin needs 3 just to hold the weights — and that’s before KV cache and activations. Params directly determine “at least how many cards you need,” and have nothing to do with how much it can compute per second.
Example: DeepSeek-V3 — parameter count ≠ per-token compute
MoE (mixture-of-experts) models fully decouple Params from compute volume. DeepSeek-V3 has 671B total parameters but activates only about 37B per token. So its memory footprint is sized by 671B (it all has to fit), while its per-token compute (that ) is sized by only 37B. Estimating its inference throughput from the total parameter count overestimates by nearly 20×.
Example: weights aren’t the only thing eating capacity — KV cache
Besides Params, memory also holds the KV cache (every request, every already-generated token must be cached). With long context and large batches, the KV cache can balloon to the same order of magnitude as the weights, directly squeezing the usable batch. So Params set the floor and KV cache sets how much concurrency you can still stuff in — together they eat capacity.
FLOPS — Compute throughput · floating-point operations per second
Divide the “count” above by time and you enter the world of compute throughput. The core distinction: uppercase S = per Second. .
Example: how long does GPT-3 training actually take
Divide the task’s FLOPs by the cluster’s throughput to get the ideal runtime. GPT-3’s FLOPs, on 1024 A100s (312 TFLOPS each) at 100% utilization:
But real MFU is only thirty or forty percent (see below), so in practice it’s on the order of a month. This is the standard way FLOPs (the task) and FLOPS (the throughput) are used together.
Example: the same task, theoretical time on three generations
Running the same FLOPs training task, the theoretical single-card full-load time shrinks linearly with peak throughput: A100 (312 TFLOPS) about 8.9 hours, H100 (990) about 2.8 hours, B200 (2250) about 1.2 hours. The nearly 7× gap across three generations comes almost entirely from the rise in Tensor Core peak — provided the workload can really saturate the compute, which is exactly the question Roofline answers.
Peak FLOPS — Compute throughput · theoretical peak
The throughput ceiling a vendor quotes. When you look at this number, always keep an eye on two things: which precision, and whether sparsity doubling is included.
Example: the precision ladder — five peaks for the same H100
On the same chip, each step down in precision roughly doubles the peak. On a single H100: FP64 about 67 TFLOPS, TF32 about 495, FP16/BF16 990, FP8 about 1979 TFLOPS. So “what’s the H100’s throughput” has no single answer — you must first ask which precision. Low-precision training/inference is the mainline precisely because it directly doubles the usable peak.[3]
Example: the sparsity-doubling trap — where B200’s “9000” comes from
Marketing material often prints “B200 FP4 throughput 9000 TFLOPS,” but that’s the figure with 2:4 structured sparsity turned on; dense FP4 is only about 4500 TFLOPS. Only workloads that satisfy the sparsity pattern can claim the sparse peak; dense matrix multiplication is genuinely halved. Comparing a sparse peak against someone else’s dense peak comes with a built-in 2× inflation.
Example: cross-generation peak — 2380× in ten years
Zoom out: from P100 (2016) to Rubin R100 (2026), AI peak throughput grew about 2380×, while general-purpose CUDA Core FP32 throughput over the same period grew only about 10×.[1] The scissor gap between these two curves shows that the decade’s compute explosion happened almost entirely along the “specialized matrix multiply + low precision” line, not because general-purpose floating point got faster.
TOPS — Compute throughput · trillion operations per second
Structurally identical to FLOPS, but “Operations” usually means integer operations (especially INT8), most common in the spec sheets of inference accelerators and edge chips. Use FLOPS for floating point and TOPS for fixed-point/integer; with the unit changed, don’t compare magnitudes directly.
Example: datacenter INT8 — A100 to B200
The A100 is rated at 624 TOPS (INT8), the H100 about 1979 TOPS (INT8 dense), and the B200 higher still. INT8 quantized inference is very cost-effective in scenarios less sensitive to precision such as classification, retrieval, and recommendation — trading an integer pipeline for nearly double the throughput.
Example: edge chips — Jetson Orin and Thor
Edge spec sheets look almost exclusively at TOPS. The Jetson AGX Orin is rated at 275 TOPS (INT8), and 2025’s Jetson Thor jumps to roughly 2070 TFLOPS (FP4) — bringing the datacenter’s low-precision path down to robotics and automotive. Note that Orin reports TOPS (integer) while Thor reports FP4 TFLOPS (floating point); different conventions, not directly subtractable.
Memory bandwidth — Memory access · Memory Bandwidth
How many bytes per second can move between memory and the compute units. LLM autoregressive decode is almost always stuck on this: to generate each token you must read the entire model’s weights from memory once, which is pure data movement — no matter how fast you compute, you have to wait for the data to arrive.
Example: single-stream 70B decode, three generations of bandwidth set the ceiling
Generating one token requires reading the weights at least once. An FP16 70B model’s weights are 140GB; dividing that by bandwidth gives the theoretical lower bound per token:
| GPU | Bandwidth | Reading 140GB of weights | Ceiling tok/s |
|---|---|---|---|
| A100 | 2.0 TB/s | 70 ms | ~14 |
| H100 | 3.35 TB/s | 42 ms | ~24 |
| B200 | 8.0 TB/s | 17.5 ms | ~57 |
This column of numbers is determined entirely by bandwidth and has nothing whatsoever to do with peak throughput (312 / 990 / 2250 TFLOPS) — A100 to B200’s roughly 4× decode speedup is owed entirely to bandwidth growing 4×.
Example: why quantization speeds up decode
Since the decode bottleneck is “the number of bytes read for the weights,” shrinking the weights directly speeds it up. Take the 70B above from FP16 to FP8, and the weights drop from 140GB to 70GB; the bytes read per token are halved, and decode throughput nearly doubles — this has nothing to do with compute, it’s purely moving half as much data. FP4 halves it again. This is the core value of low-precision inference under the bandwidth wall.
Memory capacity — Memory access · Memory Capacity
How large a model plus batch/KV cache can fit. It decides “feasible / infeasible,” not fast or slow.
Example: the largest dense model each generation fits on one card
Roughly figuring on FP16 (2 bytes/parameter), memory capacity directly frames the weight ceiling a single card can hold: the 80GB A100/H100 ≈ 40B parameters, the 192GB B200 ≈ 96B, the 288GB Rubin ≈ 144B. Anything bigger must be split across cards with tensor parallelism — when capacity isn’t enough, no amount of throughput or bandwidth is even on the table.
Example: long context drains capacity
Capacity isn’t only for weights. A 70B model at 128K context with a large batch can easily have the KV cache eat tens of GB, competing for the same memory as the weights. This is why B200/Rubin pushed single-card capacity from 80GB all the way to 192/288GB — in the era of long context plus high concurrency, capacity itself is a product strength.
Bytes moved — Memory access · Bytes
How many bytes an operator actually reads and writes to complete its task (input + weights + output). It isn’t a KPI on its own; rather, paired with FLOPs, it yields the next section’s most crucial quantity, “arithmetic intensity.”
Example: the difference in bytes moved between decode and prefill
The same 70B model: decoding one token is a matrix-vector product (GEMV), reading 140GB of weights but doing only about FLOPs; prefilling 1000 tokens is a matrix-matrix product (GEMM), still reading those 140GB of weights (reused across 1000 tokens) but doing about FLOPs. The bytes moved are almost the same, the compute differs by 1000× — which is exactly why one is memory-bound and the other compute-bound.
Arithmetic intensity — Efficiency · Arithmetic Intensity
A minimal but extremely useful definition:
Its physical meaning is “how many operations can be amortized over each byte moved from memory.” High intensity → one data fetch keeps you computing for a long time → compute-bound; low intensity → most of the time is spent waiting on data → bandwidth-bound.
Example: the two extremes of GEMM and GEMV
In a large matrix multiply (GEMM), each element is reused times, so arithmetic intensity reaches hundreds or even thousands of FLOP/Byte; whereas LLM decode is GEMV, where each weight is used only once, and at batch=1 the arithmetic intensity drops to ~1–2 FLOP/Byte. On the same card, one saturates compute and the other is locked by bandwidth, the difference being only this one ratio.
Example: batch is the knob for raising arithmetic intensity
At decode batch=1 each weight serves one request (intensity ~1); at batch=256 the same weights, read in once, serve 256 requests, and arithmetic intensity rises to ~256 FLOP/Byte — pushing the workload from “the left of the bandwidth wall” past the ridge point toward the compute side. This is the fundamental reason inference serving works so hard to amass batch: not to save compute, but to give otherwise-idle compute something to do.
Roofline — Efficiency · the Roofline model
Plot arithmetic intensity (x-axis) against attainable throughput (y-axis) on a single log-log chart and you get a “roof” curve:
The left segment is the slope-1 bandwidth roof, the right segment the horizontal compute roof. Where the two meet is the ridge point, whose x-coordinate = peak throughput ÷ bandwidth. A workload to the left of the ridge is memory-bound, to the right compute-bound.
Example: why three generations’ ridge points stay steady at 150–300
Ridge point = peak throughput ÷ bandwidth. A100: 312 ÷ 2.0 ≈ 153; H100: 990 ÷ 3.35 ≈ 295; B200: 2250 ÷ 8.0 ≈ 281. Across three generations throughput grew 7× and bandwidth grew 4×, yet after dividing the ridge point barely moved — because vendors add throughput and bandwidth together. This implies a plain but brutal conclusion: as long as your workload’s arithmetic intensity is below about 300, switching to any newer card still leaves it memory-bound, and the speedup tracks bandwidth, not peak throughput.
Example: LLM decode is forever at the left end of the roofline
Decode’s arithmetic intensity is ~1, far below any generation’s ridge point (150–295), so it sits at the far left of all three sloped roofs, its ceiling forever bandwidth. This explains at the root why decode optimization all revolves around “read fewer weights / reuse weights more”: quantization (read less), MoE (activate only part), speculative decoding (produce several tokens from one forward pass), and growing the batch (reuse) — not one of them is about piling on compute.
GPU utilization — Efficiency · Utilization
It’s that percentage in nvidia-smi, and it only means “was a kernel running on the GPU during this period.”
Example: the illusion of 99% utilization
Running LLM decode, nvidia-smi sits at 99% year-round, looking like the card is wrung dry. But even if a kernel uses only 5% of the compute units and spends the rest of the time waiting on memory, as long as it’s running, utilization shows 100%. Decode is memory-bound, the GPU spends most of its time waiting on HBM, and effective throughput may be only a few percent of peak — this 99% is a famously misleading number. To judge “how much compute is actually used,” look at MFU below.
MFU / HFU — Efficiency · Model / Hardware FLOPs Utilization
The genuinely reliable efficiency metric for LLM training:
Example: 50% counts as excellent — PaLM and Megatron
Reaching 50% MFU is already quite good. Google’s PaLM 540B training reported about 46% MFU;[4] in industry, large-model training with Megatron on A100/H100 clusters generally lands at 40%–55%. Where did the other half go? Communication, pipeline bubbles, memory-access stalls, non-matmul operators… all the losses Roofline and utilization already discussed. When you see someone report “90% utilization,” first ask whether it’s nvidia-smi utilization or MFU — they differ by an order of magnitude.
Example: the difference between MFU and HFU — recomputation
MFU counts only the useful floating-point operations required by the model’s forward and backward (using in the numerator); HFU also counts extra operations like activation recomputation (recomputing a forward pass to save memory). So HFU ≥ MFU: for the same training run, HFU might be 57% while MFU is only 48%, the gap being the “wasted” work of recomputation.[5] When comparing different works, always confirm which one is being reported.
Latency / throughput — Deployment · Latency / Throughput
Latency is the response time of a single request, throughput is the volume processed per unit time. The two pull against each other through batch size.
Example: the give-and-take of batch=1 vs. batch=32
Continuing the earlier H100 70B decode example: at batch=1 it’s about 24 tokens/s with 42 ms/token latency; raise the batch to 32 and the weights are read once to serve 32 requests, so total throughput can climb to hundreds of tokens/s, but each request waits for a batch to assemble and queues longer, so latency goes up. Growing the batch trades latency for throughput — and vice versa.
Example: TTFT and TPOT — the two latency segments of a request
LLM serving splits latency into two segments: TTFT (time to first token, set by prefill, compute-bound) and TPOT (time per output token, set by decode, memory-bound). These two have different bottlenecks and different optimization levers — prefill is about compute and parallelism, decode about bandwidth and batch. Lumping them into one “latency” to optimize often pushes in the wrong direction.
Efficiency / cost-efficiency — Deployment · Perf/Watt, Perf/$
How much throughput a watt buys, and how much throughput a dollar buys. At large-scale deployment these two matter far more than peak throughput itself — the datacenter’s bottleneck is usually power and cooling, not affording the cards.
Example: three generations’ efficiency — TFLOPS/W
Using the table’s peaks and TDPs to roughly figure throughput per watt (BF16 dense): A100 312 TFLOPS ÷ 400W ≈ 0.78 TFLOPS/W; H100 990 ÷ 700 ≈ 1.41; B200 2250 ÷ 1000 ≈ 2.25. Three generations nearly tripled efficiency — and this is the math a datacenter actually does: for the same power budget, how much more effective throughput the B200 produces.
Example: why hyperscale clusters only look at perf/$ and perf/W
When you’re deploying tens of thousands of cards, the gap between 1.4 and 2.25 TFLOPS/W translates directly into millions per year in electricity, and into whether it even fits the existing facility’s power and cooling budget. At that point no one is staring at the biggest peak number anymore — within total cost of ownership (TCO), electricity and rack density are often more fatal than a card’s unit price.
Tying it together in one sentence — a single causal chain
These metrics are really one causal chain: how big the task is (FLOPs / MACs / Params) → how fast and large the machine is (FLOPS / bandwidth / capacity) → comparing the two, where the bottleneck is (arithmetic intensity / Roofline) → what fraction is actually used (MFU, not nvidia-smi utilization) → the felt performance and cost once live (latency / throughput / efficiency).
Remember the three things most likely to trip you up: lowercase FLOPs is a count, uppercase FLOPS is a rate; 1 MAC ≈ 2 FLOPs; and — the vast majority of LLM workloads are bound by bandwidth, not compute, so first see which side of the Roofline your workload’s arithmetic intensity lands on, then decide where to put the optimization effort.
References
- NVIDIA. NVIDIA H100 Tensor Core GPU Architecture / Blackwell Architecture Technical Brief. White Papers, 2022–2024. See also the spec data compiled in this site’s “A Decade of GPU Architecture Evolution” article.
- Kaplan, Jared, et al. Scaling Laws for Neural Language Models. arXiv:2001.08361, 2020.
- Micikevicius, Paulius, et al. Mixed Precision Training. ICLR, 2018.
- Chowdhery, Aakanksha, et al. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311, 2022.
- Korthikanti, Vijay, et al. Reducing Activation Recomputation in Large Transformer Models. MLSys, 2023.
- Williams, Samuel, Andrew Waterman, and David Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 2009.