Low-Precision Data Formats in Large Language Models

If you had to summarize the hardware trend in LLM training and inference over the past three years in a single sentence — training is moving from BF16 to FP8, and inference is moving from FP8 to FP4. Every time precision drops one notch, twice as many MAC units fit in the same silicon area, and LLM training throughput and inference tokens/s double. Underneath this main line is a whole spectrum of data formats: FP32 / TF32 / FP16 / BF16 / FP8 (two variants) / FP6 (two variants) / FP4 (MX and NV versions) / INT8 / INT4.

But “precision moving down” is never as simple as “swap one format for another across the whole model.” Within a single Transformer block, matrix multiplication may run in FP8, Softmax in FP32, and the KV cache may be compressed to INT8 — precision is mixed across operation types, not across layers. To see this clearly, we need to lay out every format’s bit structure, dynamic range, hardware support, and actual position in a Transformer.

This article lays that map out in one place, covering only the formats with native Tensor Core / matrix-engine hardware support — pure software emulation, research-stage, or paper-only formats (INT2, binarization, posits, etc.) are out of scope.

Why It Matters — Not Just Saving Memory

A lot of people’s first reaction to low precision is “it saves memory.” That’s only half right. What really drives NVIDIA / AMD / Google to keep pushing down on precision is another fact: each Tensor Core generation roughly doubles throughput at lower precision.

ArchitectureFlagshipSame-SM Throughput (Primary)Same-SM Throughput (Lowest)
Ampere (2020)A100312 TF (FP16)624 TOPS (INT8)
Hopper (2022)H100990 TF (FP16)1979 TF (FP8)
Blackwell (2024)B2002250 TF (FP16)9000 TF (FP4)
Rubin (2026)R100~8000 TF (FP16)50,000 TF (FP4)

Look at this table together — halve the bit width and the MAC count on the same silicon doubles; compute doubles directly.[3][4] So low precision is not just “the same model uses less memory,” it’s “the same silicon can run bigger models.” This is the fundamental driver behind NVIDIA / AMD / Intel collectively pushing into FP8 / FP4 over the last three years.

Memory savings matter too — BF16 weights are half the size of FP32; INT4 quantized weights are a quarter of FP16. But if the dividend were memory only and not compute, the road wouldn’t go this far.

The Universal Floating-Point Representation — One Formula for All Formats

Any IEEE 754-style floating-point number consists of three parts:

value=(1)S×2Ebias×(1.M)\text{value} = (-1)^S \times 2^{E - \text{bias}} \times (1.M)

So the key parameters of a floating-point format are (total bits,e,m)(\text{total bits}, e, m). The exponent count ee determines the dynamic range; the mantissa count mm determines the precision. Every format below fits this template.

Integer formats are another beast — no exponent at all, just signed or unsigned integers. Their range is linear, precision is uniform, but the representable range is fixed by the bit width. In practice, an integer format always pairs with an external scale factor that “stretches” or “shrinks” the integer values to the target numeric range. We’ll come back to that in the integer section.

Floating-Point Formats Side by Side — FP32 → FP4 at the Same Scale

First, a single chart with all nine floating-point formats aligned to the same scale — each cell is one bit. The exponent width determines dynamic range; the mantissa width determines precision. The trade-off between these two is the core design choice for every format:

Floating-Point Bit Layout · Same Scale · 1 cell = 1 bitS = sign · E = exponent (determines range) · M = mantissa (determines precision)FP321 + 8 + 23TF321 + 8 + 10FP161 + 5 + 10BF161 + 8 + 7FP8 E4M31 + 4 + 3FP8 E5M21 + 5 + 2FP6 E3M21 + 3 + 2FP6 E2M31 + 2 + 3FP4 E2M11 + 2 + 1Sign SExponent E · determines rangeMantissa M · determines precision
Nine floating-point formats aligned at the same scale (1 cell = 1 bit). FP32 / FP16 are from IEEE 754, TF32 from NVIDIA, BF16 from Google Brain, FP8 jointly defined by NVIDIA / Arm / Intel, FP6 / FP4 from the OCP microscaling spec. BF16 shares 8 exponent bits with FP32, giving it the same dynamic range but only half the precision; the two FP8 variants are two trade-offs at the same width (E4M3 favours precision and is used in the forward pass, E5M2 favours range and is used in the backward pass); FP4 (E2M1) has only 4 bits — completely unusable on its own; it must be deployed inside an MX microscaling block.

FP32 / TF32 — The Baseline and an Engineering Trick — Same 8-bit Exponent, Different Mantissa

FP32 (1 + 8 + 23 = 32 bits) is the baseline for all modern CPUs / GPUs. 8-bit exponent, bias = 127, dynamic range ~1.2×10381.2 \times 10^{-38} to 3.4×10383.4 \times 10^{38}, ~7 decimal digits of precision.

TF32 (1 + 8 + 10 = 19 bits, stored in a 32-bit register) is an engineering trick NVIDIA introduced on A100 — the exponent matches FP32 exactly (8 bits), and the mantissa is cut down to match FP16 (10 bits). Despite the name, its effective width is only 19 bits. The win: same dynamic range as FP32 → no loss scaling needed; only ~3 decimal digits of precision → significantly faster compute. On A100 / H100 / Blackwell, TF32 is the default replacement for FP32 GEMMs — you get the speedup without changing a single line of code.[4]

FP16 / BF16 — Two 16-bit Formats for the Training Era — One Favours Precision, One Favours Range

FP16 (1 + 5 + 10 = 16 bits) is IEEE 754 half-precision: 5-bit exponent, bias = 15, dynamic range ~6×1056 \times 10^{-5} to 6.5×1046.5 \times 10^{4}, 3–4 decimal digits of precision. For training, this range is dangerously narrow — gradients on the order of 10710^{-7} underflow to zero. So FP16 training must be paired with loss scaling (scaling up the loss to a magnitude that won’t underflow before backpropagation).

BF16 (1 + 8 + 7 = 16 bits) was introduced by Google Brain on TPU v2. The design idea is straightforward — chop 16 bits off the FP32 mantissa, keep all 8 exponent bits. As a result:

This is why, after 2020, large-model training overwhelmingly chose BF16 over FP16 — LLM training is far more sensitive to dynamic range than to precision. That single lesson is the direct prototype for the FP8 design that followed.[6]

FP8 — Two Variants With a Division of Labour — E4M3 Forward · E5M2 Backward

FP8 was jointly proposed by NVIDIA / Arm / Intel in 2022[2]; it has native support on H100 / MI300 / Gaudi 2/3. It defines two variants simultaneously:

VariantBit layoutDynamic rangePrecisionUse
E4M31+4+3±448\pm 448 (max) · min normal 262^{-6}higherForward: weights / activations
E5M21+5+2±57344\pm 57344 · min normal 2142^{-14}lowerBackward: gradients

Why two variants? Because activations and gradients have very different dynamic ranges. Activations, after passing through a LayerNorm, usually cluster in a tight range — well-suited to E4M3’s “precision-heavy, range-light” trade-off. Gradients can span many orders of magnitude, requiring E5M2’s “range-heavy, precision-light” design.

Note that E4M3 doesn’t strictly follow IEEE 754 — it drops inf, repurposing the inf encoding to extend the numeric range (max value 448 instead of 240). E5M2 strictly follows IEEE 754, with both inf and NaN.

FP8 scaling: 8 bits alone are nowhere near enough to cover the dynamic range encountered in training. So FP8 is almost always paired with an FP32 per-tensor scaling factor (auto-maintained by H100’s Transformer Engine)[7]; the actual value stored is FP8_value×scale\text{FP8\_value} \times \text{scale}. We’ll come back to this when discussing DeepSeek-V3’s finer-grained version.

FP6 and FP4 — The Microscaling Era — Unusable Alone · Must Share Block Scales

FP6 and FP4 are part of the OCP (Open Compute Project) Microscaling (MX) Format spec released in 2023[1]; Blackwell is the first generation of hardware to support them in Tensor Cores.[3]

FP6 has two variants: E3M2 (more range, less precision) and E2M3 (less range, more precision), chosen by use case.

FP4 has two flavours in Tensor Cores, both using the E2M1 bit layout, differing in scale granularity:

VersionBit layoutBlock sizeBlock scaleOuter scale
MXFP4 (OCP)E2M1 (4 bit)32 elementsE8M0 (8 bit, pure exponent)
NVFP4 (NVIDIA)E2M1 (4 bit)16 elementsFP8 E4M3 (8 bit)FP32 per-tensor

Why microscaling? Because FP4 has only 4 bits — only 16 distinct values (including signed zeros). Without sharing a scale across a block, those 16 values can’t possibly cover the real distribution of model weights. MX turns “per-tensor scaling” into “per-block scaling” — every 32 numbers share an 8-bit exponent scale, building fine-grained quantization right into the hardware.

NVIDIA pushed this further with NVFP4 — block size shrunk to 16, block scale upgraded from pure-exponent E8M0 to the more precise FP8 (E4M3), with an additional FP32 per-tensor scale on top. The three-layer scale structure makes NVFP4 markedly more accurate than MXFP4, at the cost of slightly more metadata overhead. On Blackwell, NVFP4 has been measured to keep inference accuracy close to FP8 — the strongest evidence yet that FP4 has finally become a “usable format.”[3]

Integer Formats INT8 / INT4 — No Exponent · Entirely Reliant on Scale

You can think of integer formats as a special case of floating-point — “all mantissa, zero exponent.” Precision is uniform, but dynamic range is entirely determined by an external scale. INT8 and INT4 dominate the consumer / deployment side of LLM inference.[8][14]

INT8 vs INT4 · Values Representable Without External ScaleLinear axis (not log) · integer formats cannot work without an external scaleINT8 · signed · 256 values−1280+127256 uniformly spaced values · requires scale + zero point to map to real numericsINT4 · signed · 16 values-8-7-6-5-4-3-2-101234567Only 16 distinct values · must use group-wise shared scales (one FP16 scale per 32 / 64 / 128 numbers)
INT8 and INT4 value ranges — integer formats have no exponent; values are uniformly distributed over a fixed integer interval. INT4 has only 16 distinct values (from −8 to +7), and is completely unusable without an external scale. The core job of weight quantization algorithms like GPTQ and AWQ is to “find a set of group-wise scales that minimise the precision loss after quantization.”

The integer-to-float mapping in practice is:

xfps(xintz)x_{\text{fp}} \approx s \cdot (x_{\text{int}} - z)

where ss is an FP16 / FP32 scale and zz is the zero point (symmetric quantization has z=0z = 0; asymmetric has z0z \neq 0). For INT8, per-tensor or per-channel scaling is typically enough; INT4 essentially demands group-wise scaling — every 32 / 64 / 128 numbers share a scale, otherwise precision loss is too steep. GPTQ[9] and AWQ[10] are at heart “find a set of group-wise scales” algorithms.[15]

Dynamic Ranges Lined Up — Every Format on the Same Log Axis

Put every floating-point format’s dynamic range on the same log axis and the differences become instantly visible:

Floating-Point Dynamic Range · Log AxisBar length = log₁₀(max) − log₁₀(min) · colour-coded by bit width10⁻⁴⁰10⁻³⁰10⁻²⁰10⁻¹⁰110¹⁰10²⁰10³⁰10⁴⁰log scaleFP321.2e−38 → 3.4e38TF32same as FP32BF16same as FP32FP166.1e−5 → 6.5e4FP8 E5M26.1e−5 → 5.7e4FP8 E4M32e−3 → 448FP6 E3M20.06 → 28FP6 E2M30.13 → 7.5FP4 E2M10.5 → 6BF16 / TF32 share an 8-bit exponent with FP32 · all three blue bars overlap exactly · this is why BF16 is a drop-in for FP32 in training
Dynamic range of nine floating-point formats (log axis). The three blue bars (FP32 / TF32 / BF16) have identical length because they share an 8-bit exponent; FP16 and FP8-E5M2 are close in range — both “range-heavy” designs; FP4’s bar is barely more than a dot — direct visual evidence of why it cannot work without a block-scale layer on top.

A few observations worth pulling out:

Precision Allocation Inside a Transformer Block — Training vs Inference

A model is never “all one precision” — each component of a Transformer picks its own precision based on the operation’s characteristics. The chart below puts training and inference side by side:

Precision Inside a Transformer Block · Training vs InferenceEvery component picks its own precision by operation type — this is the hard part of low precisionTrainingforward + backward + optimizerInferenceforward only · weights frozenMaster Weights + Optimizer StateFP32 · master + Adam m / v · ≈ 4× paramEmbeddingBF16RMSNormFP32Q / K / V ProjectionFP8 in · FP32 accQ × KᵀFP8 / BF16SoftmaxFP32Attention × VFP8 / BF16Output ProjectionFP8-E4M3+ ResidualFP32 accFFN Linear 1FP8-E4M3SwiGLUBF16 / FP32FFN Linear 2FP8-E4M3+ ResidualFP32 acc… × N layers …Backward · gradient GEMMFP8-E5M2 · wide gradient rangeLoss + Loss ScalingFP32Static Weights · offline-quantized · frozenINT4 / FP4 + group-wise FP16 scaleEmbeddingBF16RMSNormBF16Q / K / V ProjectionINT4 W · BF16 actQ × KᵀBF16SoftmaxBF16KV cacheFP8 / INT8Attention × VBF16Output ProjectionINT4 / FP4+ ResidualBF16FFN Linear 1INT4 / FP4SwiGLUBF16FFN Linear 2INT4 / FP4+ ResidualBF16… × N layers …LM Head · output logitsBF16FP32BF16FP8INT8INT4 / FP4
Precision allocation inside a Transformer block — training on the left, inference on the right. Training is essentially “two-tier” — all GEMMs use BF16 / FP8 input but accumulate in FP32, while every numerically sensitive norm / softmax / residual stays in FP32; on top of that, FP32 master weights and Adam optimizer state must be kept (BF16 + FP32 master + Adam m + Adam v ≈ 14 bytes/param). Inference goes to the other extreme — weights compress to INT4 / FP4, activations stay BF16, and the KV cache is separately compressed to FP8 / INT8.

A few non-obvious details:

Microscaling MX — The Key to Making FP4 / FP6 Actually Usable

On the dynamic-range log axis above, FP4 / FP6’s bars shrink almost to a point. They can run on hardware only because of an external block-scaling layer.

The OCP MX spec defines a remarkably simple structure — every 32 numbers share an 8-bit exponent scale:[1]

Block size: 32 elements
Per element: FP4 (4 bit) or FP6 (6 bit) or FP8 (8 bit)
Block scale: E8M0 (8 bit, pure exponent, no sign, no mantissa)

E8M0 is a special “pure exponent” format — all 8 bits store an exponent, no sign and no mantissa, representing a scaling factor of 2E2^E. A block of 32 FP4 numbers carries an actual value of FP4_value×2E\text{FP4\_value} \times 2^E.

Why 32 elements per block, and why a pure-exponent scale? Engineering trade-offs — smaller blocks improve accuracy but burn more metadata; a pure-exponent scale turns scaling into a shift in hardware, which costs basically nothing. Blackwell’s FP4 Tensor Cores can run this format natively precisely because the block-scale decoder is built into the silicon.

NVIDIA went further with NVFP4 — block size cut to 16, block scale upgraded from pure-exponent E8M0 to the finer FP8 (E4M3), with an additional FP32 per-tensor scale on top. The three-layer scale structure gives NVFP4 markedly better effective precision than MXFP4, at slightly higher metadata cost. On Blackwell, NVFP4 inference accuracy has been measured to be close to FP8 — the key evidence that FP4 has finally become a “usable format.”[3][12]

Hardware Support Matrix — V100 → Rubin · TPU · MI300

Low precision can only “take off” with hardware acceleration — without native Tensor Core / matrix-engine support, low precision saves memory but doesn’t run any faster.[4] The matrix below shows, for each format and each accelerator generation, when it first got native support:

Hardware Support Matrix · Format × GenerationSolid green = native Tensor Core / matrix engine · semi-opaque = supported but not primary · empty circle = unsupportedNVIDIAAMDGOOGLE TPUV1002017T42018A1002020H1002022B2002024R1002026MI3002023MI3502025TPUv42021TPUv52024FP32IEEE 754TF32NVIDIAFP16IEEE 754BF16Google BrainFP8E4M3 / E5M2FP6MX microscalingFP4MXFP4 / NVFP4INT8fixed-pointINT4fixed-pointNative Tensor CoreSupported, not primaryUnsupported
Hardware support matrix for low-precision formats. The green diagonal descends from FP16 (universally supported by 2017) down to FP4 (appears only on Blackwell in 2024) — roughly one row per two years, matching the “bit width halves every two years” hardware cadence. FP32 / FP16 / INT8 form the three “common base” rows; the further down you go, the more vendor- and generation-specific support becomes. Google TPU’s row is interesting — it skips the FP8 / FP4 line entirely, holding firm on BF16 + INT8.

A few patterns worth pulling out:

Rubin — No New Formats · Just Pushing FP4 to 50 PFLOPS

NVIDIA’s next-gen architecture Rubin (announced at GTC 2024, shipping 2026) has a few judgments worth recording:

Precision Choices in Mainstream LLMs — Llama 4 · DeepSeek-V3 · Gemini · Claude

Putting the major LLMs’ training and inference precisions side by side:

Training and Inference Precision in Mainstream LLMs · 2024-2026Closed models’ training precision inferred from hardware generation · open models have explicit technical reportsTrainingServer InferenceLocal / QuantizedLlama 2Meta · 2023BF16BF16 / FP16INT4 GPTQLlama 3 / 3.1Meta · 2024BF16 · 16K H100BF16 / FP8INT4 GPTQ / AWQLlama 4Meta · 2025FP8 · 32K H100FP8FP4 / INT4DeepSeek-V2DeepSeek · 2024BF16BF16 / FP8INT4DeepSeek-V3DeepSeek · 2024FP8 fine-grainedFP8INT4 / FP4Qwen 2.5Alibaba · 2024BF16BF16 / FP8INT4 AWQMixtral 8x7BMistral · 2024BF16BF16 / FP8INT4Gemini 2Google · 2024BF16 (TPU)BF16GPT-4OpenAI · 2023BF16 (assumed)FP8 (assumed)Claude 3 / 4Anthropic · 2024BF16 (assumed)undisclosedBF16FP8INT4 / FP4undisclosed / N/A
Training and inference precision in mainstream LLMs (2024-2026). Green = BF16, orange = FP8, red = INT4 / FP4. The migration path is clear — the Llama series moved from BF16 training (Llama 3) to FP8 training (Llama 4); DeepSeek-V2 was still BF16 while V3 cracked FP8 fine-grained training. Gemini stays on BF16 because of its TPU lock-in. OpenAI / Anthropic don’t disclose training precision; the assumption is still BF16-dominant.

A few observations worth unpacking:

Summary — Training BF16 → FP8 · Inference FP8 → FP4

The current LLM data-format trend in one sentence: training is on the BF16 → FP8 road; inference is on the FP8 → FP4 road; and each landing depends on a new generation of Tensor Core hardware support.

Boiling the article down:

Threads to dig further on next time — DeepSeek-V3’s FP8 fine-grained scaling engineering details; NVFP4 vs MXFP4 accuracy comparison across workloads on Blackwell; the “software / hardware co-design” philosophy behind TPU’s BF16 stance.

References — Standards · Papers · Engineering Blogs

Standards and White Papers

Papers — Training Side

Papers — Inference Quantization

Engineering Blogs and Docs