A Decade of GPU Architecture Evolution and the Parallel Expansion of the CUDA Programming Model

If you had to summarize a decade of NVIDIA data-center GPU evolution in a single number, it would be 2380× — the AI-compute uplift from P100 (2016) to Rubin R100 (2026). Over the same period, CUDA core FP32 throughput grew only 10×. The two numbers are two orders of magnitude apart, and you can almost say: what NVIDIA actually did this decade was move general-purpose compute budget into Tensor Cores, then keep pushing mixed precision down to FP4.

But this wasn’t a matter of “swap the chip and you’re done.” A Tensor Core is not a faster CUDA core; it is a completely different circuit topology. To feed it, the SM gained TMA, TMEM, and cluster-shared SMEM; and so that programmers could write code capable of using it, the CUDA programming model wedged Warp and Cluster as two new layers between Thread/Block/Grid. So while the same torch.matmul(a, b) looks identical on V100, A100, H100, and B200, NVIDIA rewrote cuBLAS for every single generation underneath. This article explains hardware evolution and programming-model expansion together, because they are two threads of the same story.

Overview — A Decade Compressed into One Chart

Side-by-side: Seven Generations of Flagships — Pascal · Volta · Turing · Ampere · Hopper · Blackwell · Rubin

Seven generations of data-center GPUs side by side — every row is the physical embodiment of an NVIDIA strategic decision:

Year	Architecture	Flagship	Process	Transistors	FP32 (CUDA Core)	Primary Tensor Core precision	Lowest Tensor Core precision	HBM	NVLink
2016	Pascal	P100	16 nm	15 B	10.6 TF	—	— (no TC)	16 GB HBM2	160 GB/s
2017	Volta	V100	12 nm	21 B	15.7 TF	125 TF (FP16)	125 TF	32 GB HBM2	300 GB/s
2018	Turing	T4	12 nm	14 B	8.1 TF	65 TF (FP16)	130 TOPS (INT8)	16 GB GDDR6	—
2020	Ampere	A100	7 nm	54 B	19.5 TF	312 TF (FP16)	624 TOPS (INT8)	80 GB HBM2e	600 GB/s
2022	Hopper	H100	4 nm	80 B	67 TF	990 TF (FP16)	1979 TF (FP8)	80 GB HBM3	900 GB/s
2024	Blackwell	B200	4 nm × 2 die	208 B	80 TF	2250 TF (FP16)	9000 TF (FP4)	192 GB HBM3e	1.8 TB/s
2026	Rubin	R100	3 nm	336 B	~100 TF	~8000 TF (FP16)	50,000 TF (FP4)	288 GB HBM4	3.6 TB/s

Look at the two rightmost columns and you see an inverted scissor: CUDA Core FP32 throughput grew from 10.6 to 100 TFLOPS — about 10×; Tensor Core lowest-precision throughput grew from 0 to 50,000 TFLOPS — effectively infinite. The gap between these two curves is the central narrative of this article.

Where Did the Compute Uplift Come From — Geometric Stacking · General 10× · Specialized 200×

Decomposing the 2380×, the bulk is not “general-purpose compute grew faster,” but rather “several specialized multipliers around matrix multiplication multiplied together.” The chart below places four independent multipliers on the same log axis — the longer the bar, the larger the contribution:

Decomposition of 2380× — horizontal axis is log scale; bar length ∝ log₁₀(factor). The blue bar (× 10) is general-purpose compute growth from the CUDA Core perspective — accumulated process, clock, and SM count over a decade. Orange · green · red sum to roughly 448× — the return from reallocating silicon budget to Tensor Cores: topology replacement · precision shrink · sparsity. CUDA Core path contributes under 5%.

The next question is: how can Tensor Cores rise this fast? How are they actually different from CUDA Cores? Why “scale them” instead of “scale CUDA Cores”? That requires drilling into the SM.

Inside the SM — Shared Front-end · Five Parallel Pipes

Five Parallel Execution Units — Not “Cores + Multiple Scheduling Modes”

An SM (Streaming Multiprocessor) is the physical “workshop” of a GPU; a B200 has 148 SMs. A common misconception is to think of an SM as “one kind of core with multiple scheduling modes” — in reality, an SM contains five completely independent execution units, placed in parallel, sharing the same front-end control and the same register file and SMEM.

NVIDIA Blackwell SM internal architecture diagram — NVIDIA Blackwell Ultra SM official architecture diagram (source · NVIDIA Developer Blog) — the top warp scheduler + dispatch + register file is the shared front-end; the middle holds 128 CUDA Cores (blue), 4 fifth-gen Tensor Cores (green), and 4 SFUs in parallel; purple blocks are the new 256 KB TMEM; bottom is SMEM/L1 at 228 KB total. **Five execution units sit in parallel** — there is no “Tensor Core orchestrates CUDA Cores” relationship.

The most commonly misread part of this image is the parallel relationship between CUDA Cores and Tensor Cores. A common wrong model is “the Tensor Core internally coordinates multiple CUDA Cores to perform matmul” — this is wrong. The Tensor Core has its own dedicated, matmul-specialized silicon; in a B200 SM, the Tensor Core occupies more silicon area than the CUDA Cores. The straightforward counterexample: CUDA Core FP32 throughput is 80 TFLOPS, while Tensor Core FP16 throughput is 2250 TFLOPS — 28× higher. If Tensor Cores really were “implemented by orchestrating CUDA Cores,” where would this 28× come from?

How the Warp Scheduler Dispatches Instructions — Five Physically Independent Pipes · Shared Front-end

A more accurate framing is — the warp scheduler dispatches instructions to different pipes based on the instruction type: fma.f32 goes to the CUDA Core pipe, mma.sync to the Tensor Core pipe, ex2.f32 to the SFU pipe, ld.global to the LD/ST pipe, cuda::memcpy_async to the TMA pipe. The five pipes are physically completely independent, sharing only the front-end fetch, decode, and register file. This is structurally identical to how AVX units, scalar ALUs, and AES engines sit in parallel inside a CPU.

CUDA Core vs Tensor Core Circuit Differences — FMA Pipeline vs Systolic Array

Drill to the circuit level and the differences are not “size” — they are topology.

Scalar FMA Pipeline — CUDA Core’s Circuit

A CUDA Core is a scalar FMA unit: a 24×24-bit integer multiplier + a 24-bit adder + pipeline registers, around 20-30K transistors. Throughput is 1 FMA per cycle (= 2 floating-point ops). 128 CUDA Cores per SM are 128 independent FMA circuits placed in parallel, each working independently.

2D PE Array — Tensor Core’s Circuit

A Tensor Core is a 2D multiply-accumulator array, drawing on the core idea of systolic arrays. Data streams in by time step, each PE (Processing Element) performs a multiply-accumulate, and results either pass to neighbors or accumulate locally:

Left: CUDA Core is a scalar FMA + pipeline, each instruction scheduled independently through the warp scheduler. Right: Tensor Core is a 2D PE array; blue A matrix streams in from the left, green B matrix streams in from the top; 16 PEs multiply-accumulate simultaneously; accumulated results stay in PE local registers — not two scales of the same circuit, but completely different topologies.

Why a Few-Hundred-× Gap — Three Physical Reasons

Why does this topology bring a few-hundred-× uplift? Three physical reasons:

Maximize data reuse — a row of A entering the array flows through multiple PEs, reused N times, magnifying bandwidth by N. This is the physical foundation for exploiting matmul’s high arithmetic intensity.
Eliminate control overhead — each FMA on a CUDA Core traverses warp scheduler → dispatch → register read → execute → writeback; control consumes 80%+ of the energy. The Tensor Core array is hardwired; one mma.sync instruction activates the whole array for several cycles with zero dynamic scheduling.
Local wires instead of global wires — wires between PEs in the array are short; wires between CUDA Cores must traverse the register file and are long. Short wires mean higher clock, lower power, smaller area.

NVIDIA in patents and papers deliberately avoids publicly admitting that Tensor Cores are textbook systolic arrays (Google TPUs do admit this); internally there are likely variants like adder trees mixed in. But this doesn’t affect the understanding — the critical insight is “lay down a hardwired MAC array”, not “a faster FMA.”

Circuit Difference Cheatsheet — Topology · Control · Programming Granularity

The table below nails down the circuit differences:

Dimension	CUDA Core	Tensor Core
Circuit topology	scalar FMA + pipeline	2D MAC array + hardwired data path
MAC count per operation	1	thousands to hundreds of thousands (growing each gen)
Control mode	dynamic scheduling (warp scheduler dispatches each instruction)	hardwired (one instruction activates the array for several cycles)
Programming granularity	each thread issues independently	32-thread warp cooperative issue
Suited workloads	scalar / vector / control flow / activation functions	matmul / convolution
Silicon area share	~30%	~40% (B200 SM estimate)

SFU / TMA / TMEM — Auxiliary Circuitry Built to Feed Tensor Cores

Tensor Core throughput grew so fast that it forced a sequence of supporting circuits. They don’t add FLOPS but ensure Tensor Cores “don’t starve” — and starvation is the number-one enemy of GPU optimization.

SFU — Transcendental Functions Hardwired

The SFU (Special Function Unit) handles transcendentals — sin / cos / exp / log / sqrt / rsqrt. The circuit is a small ROM (storing tens to thousands of sample points) plus a multiply-adder: the high bits of input x look up two samples, the low bits drive linear or quadratic interpolation, producing a result in a few cycles. This is tens of times faster than running a Taylor expansion on a CUDA Core. The cost is precision — __expf(x) via SFU gets about 8-9 significant bits; expf(x) in software has full 23 bits. The reason SFU count is small (4 per SM vs 128 CUDA Cores, ratio 1:32) is that it is pipelined: it can accept a new instruction every cycle, and 4 SFUs are enough to keep a warp’s transcendentals unblocked.

TMA — A Dedicated Unit for Async Tensor Movement

The TMA (Tensor Memory Accelerator) is a “tensor transfer specialist” introduced in Hopper. It addresses the problem that using LD/ST to move large tensors is wasteful: 32 threads of a warp each computing addresses, looping, and issuing loads has huge instruction overhead. TMA abstracts this into “tensor descriptor + one instruction”: you predefine a tensor map describing the tensor to move (multi-dim shape, stride, base address), then one thread issues a single instruction; the TMA hardware generates the address sequence, batches memory requests, writes into SMEM, and signals via async barrier when done. It is a small processor with a state machine, containing an AGU (Address Generation Unit), a request scheduler, and a layout transformer. Critically, TMA is asynchronous — the issuing thread returns immediately to do other work and re-synchronizes when data arrives. This unlocks the warp specialization pattern — a producer warp continually issues TMAs while a consumer warp continually drives Tensor Cores, fully overlapping the two.

TMA async tensor transfer vs LDGSTS instruction comparison — NVIDIA official comparison diagram (source) — left: Ampere A100 era uses the `LDGSTS` instruction to move tensors; each thread in a warp must compute an address and issue a load. Right: from Hopper H100 onward, a single thread issues one TMA instruction; the entire tensor is moved asynchronously to SMEM while the other threads simultaneously drive Tensor Cores. This is the physical basis of warp specialization.

TMEM — Tensor Core Dedicated Accumulator Memory · Blackwell Onward

TMEM (Tensor Memory) appeared only in Blackwell — 256 KB of Tensor Core-dedicated SRAM bound inside the SM. It addresses a deeper bottleneck — when Tensor Core throughput hits multi-PFLOPS, even the register file can’t keep up. Before Hopper, mma operands and accumulators lived in the 32 threads’ registers; register-file port bandwidth became the bottleneck. Blackwell introduces TMEM: operands and accumulators sit in dedicated SRAM; the new tcgen05.mma instruction explicitly targets TMEM rather than registers, bypassing register ports entirely. This is why Blackwell can reach 80.7% of theoretical peak on FP64 GEMM (the equivalently sized H200 only reaches 55.6%).

Unit	Role	Circuit essence	Generation introduced
SFU	transcendentals	ROM + interpolation multiply-adder	pre-Volta
TMA	async tensor transfer	AGU + state machine + layout transform	Hopper
TMEM	Tensor Core dedicated accumulator memory	independent SRAM + dedicated R/W ports	Blackwell

These three together tell the same story — the stronger the Tensor Core, the more specialized auxiliary circuitry surrounds it. This is the physical embodiment of the “specialization ladder” growing narrower as you climb.

Memory Hierarchy — Explicit Management is the Essence of the GPU

Explicit vs Implicit Boundary — CPU Lets You Forget · GPU Forces You to Face

The GPU storage hierarchy is far more complex than the CPU’s, and the core difference is not “number of levels” but the boundary of “explicit” vs “implicit”. CPUs let you forget caches exist — write int x = arr[i] and hardware fetches automatically. GPUs force you to face them — write __shared__ float tile[256] to declare the buffer, then explicitly tile[tid] = gmem[idx] to move data, then __syncthreads() to wait.

GPU seven-tier memory hierarchy — top to bottom is slower and larger. Orange borders = programmer-explicit (SMEM, TMEM: must declare and move data manually); blue borders = hardware-automatic (Register, L1, L2: just access in code); green borders = mixed (GMEM, Peer GPU HBM: allocation explicit, access implicit). Dashed tiers are outside the GPU, accessed via interconnect.

Why Explicit Management Is Necessary — One SM Runs 2048 Threads

Why is explicit management necessary on a GPU? Because one SM concurrently runs 64 warps × 32 threads = 2048 threads; there is simply no way to give each thread its own cache and prefetcher. A CPU core serves one thread and can easily afford megabytes of cache; a GPU thread gets far fewer hardware resources per thread, so threads must explicitly organize data sharing among themselves — this is the fundamental reason SMEM must exist.

Management Mode Cheatsheet — Allocation · Movement · Access · Release

Storage	Allocation	Data movement	Access	Release
Register	compiler-automatic	—	implicit	automatic
SMEM	programmer-declared	manual by programmer	implicit	automatic (at block end)
TMEM	programmer-explicit alloc	manual by programmer	via tcgen05 instructions	programmer dealloc
Constant	programmer-declared	one-time programmer fill	implicit	automatic
L1 / L2	—	hardware-automatic	implicit	—
GMEM	`cudaMalloc`	`cudaMemcpy`	implicit	`cudaFree`
Other GPU HBM	NCCL allocates	NCCL/NVSHMEM	via API	NCCL releases

The essence of GPU performance optimization is — keep data in the upper half of the hierarchy (registers / SMEM / TMEM / L2) and minimize trips back to HBM. Kernel fusion, FlashAttention, CUTLASS, warp specialization — all of these techniques boil down to this one objective.

Precision Evolution — FP4 Isn’t About Saving Money · It Treats One Chip as Four

Why Push Precision Down — More MACs Per Die

Mixed precision is the second main thread of this decade’s compute uplift — equal in stature to Tensor Core topology. Every generation introduces a lower precision, and the reason isn’t “cheaper” — it’s using the same silicon as multiple chips.

NVIDIA Hopper added FP8 precision E4M3 / E5M2 formats — NVIDIA official diagram (source) — Hopper introduced two FP8 formats: E4M3 (4-bit exponent + 3-bit mantissa, precision-favored, for forward) and E5M2 (5-bit exponent + 2-bit mantissa, range-favored, for backward). Blackwell adds FP6 and FP4 on top.

Physical Basis — Multiplier Area ∝ Width²

The underlying physical reason: multiplier area scales roughly with the square of bit width.

FP32 multiplier: 24×24-bit integer multiplier (implicit 1 bit) → area unit 1
FP16 multiplier: ~11×11 → area about 1/5
FP8 multiplier: ~5×5 → area about 1/25
FP4 multiplier: ~2×2 → area about 1/144

For the same silicon area, an FP4 multiplier fits 100× more units than FP32. Put differently — FP4 is not “cheap FP16,” it is “4× the MACs in the same silicon”.

Below is the Tensor Core throughput comparison across precisions on B200:

Precision	Input width	Accumulator	B200 throughput	Relative to FP32
FP32 (CUDA Core)	32-bit	32-bit	80 TFLOPS	1×
TF32 (Tensor Core)	19-bit effective	32-bit	~1,100 TFLOPS	~14×
FP16 / BF16 (Tensor Core)	16-bit	32-bit	2,250 TFLOPS	28×
FP8 (E4M3 / E5M2)	8-bit	32-bit	4,500 TFLOPS	56×
FP6 (E3M2 / E2M3)	6-bit	32-bit	~6,750 TFLOPS	84×
FP4 (E2M1)	4-bit	FP32 or FP16	9,000 TFLOPS	112×
FP4 + 2:4 sparsity	4-bit	FP32 or FP16	18,000 TFLOPS	225×
FP64 (Tensor Core)	64-bit	64-bit	40 TFLOPS	0.5×

Note — Tensor Cores don’t only run low precision. They also support FP32 (via TF32 mode) and FP64 (for HPC). But low-precision throughput is far higher than high-precision, by deliberate trade-off, not capability limit.

Precision Added Per Generation — One Level Down Each Generation

The precisions added by each generation, mapped to hardware generations:

Generation	Year	New precision	Main driver
Volta	2017	FP16 (Tensor Core first)	training
Turing	2018	INT8 / INT4	inference quantization
Ampere	2020	TF32 / BF16 / 2:4 sparsity	training stability + free speedup
Hopper	2022	FP8 (E4M3 + E5M2)	LLM training + inference
Blackwell	2024	FP6 + FP4 + microscaling	LLM inference peak throughput
Rubin	2026	optimized FP4 + more microscaling	agent / reasoning inference

Every generation adds a new precision — but whether it’s usable is another matter. FP8 took two years from H100 release (2022) to production usability (2024), mainly via NVIDIA’s Transformer Engine library managing scaling factors automatically. FP4 is similar — hardware support is easy; numerical stability is the software task.

Transformer Engine conceptual diagram — Transformer Engine concept diagram (NVIDIA source) — at runtime, each layer’s activation statistics drive dynamic selection of FP8 / BF16 / FP16, with scaling factors managed automatically. Programmers write BF16 training code while the underlying FP8 Tensor Cores work invisibly, keeping precision loss under control.

TF32 and Transformer Engine — Silent Speedups Without Code Changes

TF32 is a particularly elegant design. Its programmer-facing interface is FP32 (8-bit exponent + 24-bit mantissa, identical to IEEE FP32), but internally it truncates the mantissa to 10 bits (matching FP16) while preserving the 8-bit exponent. So the same cublasSgemm call — user changes nothing — automatically uses Tensor Cores for FP32 matmul, running ~14× faster than true FP32, with precision loss almost imperceptible for deep learning. This is NVIDIA’s clever way of letting Tensor Cores “silently take over all FP32 matmul.”

Deep Dive by Generation — Pascal · Volta · Turing · Ampere · Hopper · Blackwell · Rubin

Full H100 GPU block diagram · 144 SMs — Full H100 GPU block diagram (NVIDIA source) — 8 GPCs × 9 TPCs × 2 SMs = 144 SMs, surrounded by 60 MB of L2 cache and 6 HBM3 stacks. Each GPC is the physical boundary of a Thread Block Cluster — within the cluster, blocks share SMEM.

Decompose the decade generation by generation, and each maps to a turning point in AI workloads:

Pascal · Volta · Turing — 2016-2018 · Tensor Core Era Begins

Pascal (2016, P100) — year one of data-center GPUs. First time abandoning GDDR for HBM2 (720 GB/s), first introduction of NVLink (160 GB/s, ~5× PCIe), first native FP16 (on CUDA Cores, 2× FP32 throughput). No Tensor Core yet. This generation proved the commercial viability of “GPUs in the data center training deep learning.” Most GPT-2 / BERT-era training ran on P100s.

Volta (2017, V100) — Tensor Core era begins. First-gen Tensor Cores, 8 per SM, performing 4×4×4 FP16 matmul, delivering 12× training speedup. This was the fundamental shift in GPU design philosophy — from “general-purpose parallel compute chip” to “AI chip purpose-built for matmul acceleration.” From this generation onward, per-generation compute gains come mainly from Tensor Cores, not CUDA Cores. The earliest GPT-3 was trained on V100 clusters.

Turing (2018, T4 + RTX 20) — inference market diverges. T4 is low-power (70W), introduces INT8 Tensor Cores, targets inference serving; the same-generation consumer RTX 20 series introduces RT Cores (ray tracing hardware). Proved viability of the “V100 for training, T4 for inference” differentiation strategy.

Ampere · Hopper — 2020-2022 · Industrializing LLM Training

Ampere (2020, A100) — industrializing LLM training. Third-gen Tensor Cores introduced TF32 (auto-speedup for vast amounts of FP32 code with hardly any changes), BF16 (wider numerical range, more stable training), and 2:4 structured sparsity (another 2× throughput). Added MIG (slice one A100 into 7 small GPUs), second-gen NVLink (600 GB/s), HBM2e. A100 is NVIDIA’s most durable card — still serving in large numbers across data centers in 2026.

Hopper (2022, H100) — Transformer specialization. Fourth-gen Tensor Cores + the Transformer Engine software layer = native FP8 + automatic quantization for the first time. Introduced TMA to solve “Tensor Core is too fast, registers can’t keep up,” and Thread Block Cluster to let multiple SMs cooperate (8 blocks per cluster sharing distributed SMEM). NVLink 3 (900 GB/s) + HBM3 (3 TB/s). H100 is 6-9× faster than A100 on Transformer training. Grace-Hopper Superchip (GH200) for the first time tightly couples CPU and GPU via NVLink-C2C.

Blackwell · Rubin — 2024-2026 · Multi-die + FP4 + Platform

Blackwell (2024, B200) — multi-die + FP4 era. First dual-die design — two reticle-limit dies connected by 10 TB/s internal interconnect, presenting to software as a single GPU, totaling 208 B transistors. Fifth-gen Tensor Cores natively support FP4 / FP6, paired with second-gen Transformer Engine’s micro-tensor scaling. Introduced TMEM (256 KB Tensor Core dedicated SRAM), 2-CTA Cluster (two SMs jointly feeding one Tensor Core UMMA), and decompression engines (LZ4/Snappy/Deflate, for data analytics). NVLink 5 (1.8 TB/s) + HBM3e (8 TB/s, 192 GB). GB200 NVL72 integrates 72 B200s + 36 Grace CPUs into a single rack.

Rubin (2026, R100) — platform era. 336 B transistors, a truly multi-die design (two compute dies + two I/O dies). HBM4 (288 GB, 22 TB/s bandwidth, ~2.8× B200). 224 SMs/GPU, FP4 throughput 50 PFLOPS (sparse). NVLink 6 (3.6 TB/s/GPU). The most important shift is philosophical — NVIDIA no longer sells GPUs but an entire rack-scale AI compute platform; the Vera Rubin platform is organized around “seven-chip co-design” (R100 GPU + Vera CPU + NVLink 6 Switch + ConnectX-9 + BlueField-4 + Spectrum-6 + an integrated Groq 3 LPU).

Ten-Year Main Threads — Precision · Array · Storage · Interconnect · Business Form

In a single table, the main threads of the decade are crystal clear:

Thread	Manifestation
Precision shrink	FP16 → INT8 → BF16 → FP8 → FP4, one added each gen
Tensor Core array size	4×4×4 → 16×16×16 → 64×8×16 → 128×256×16
On-die storage made explicit	SMEM expanded → cp.async → TMA → Cluster → distributed SMEM → TMEM
Exponential interconnect growth	NVLink 160 GB/s → 3600 GB/s · HBM 720 GB/s → 22 TB/s
Business form	sell GPUs → sell DGX → sell racks → sell AI factories

CUDA Cores are nearly absent from this table — their FP32 throughput went from 10.6 to 100 TFLOPS, only 10× in a decade. General compute ceding ground to specialized compute is the most direct conclusion of the decade.

Programming Model Expansion — Thread → Warp → Block → Cluster → Grid

From Three Tiers to Five — Warp and Cluster Inserted in the Middle

As hardware complexity rose, the programmer-facing programming model expanded with it. The classic CUDA model is three tiers — Thread / Block / Grid: each thread is independent, threads inside a block share SMEM, and the Grid is all blocks. But after the Tensor Core era, Warp and Cluster were forcibly inserted in the middle, making it five:

NVIDIA Thread Block Clusters official diagram — NVIDIA official Thread Block Cluster diagram (source · NVIDIA Hopper Architecture In-Depth) — left is the classic three tiers (Thread / Block / Grid); right is the four-tier model added in Hopper — inserting **Thread Block Cluster** in the middle. All blocks in a cluster are guaranteed to be scheduled to the same GPC (GPU Processing Cluster), and they can directly access each other’s shared memory via distributed SMEM. Portably, max cluster size is 8 blocks; opt-in on Hopper raises it to 16.

CUDA’s five-tier programming model and its hardware mapping — left column is software abstraction, right column is the physical entity. The two orange-bordered tiers (Warp / Cluster) are the ones explicitly exposed to programmers in the Tensor Core era.

Why Warp Must Become Explicit — mma.sync Is Physically a Warp-Level Instruction

Why did the Tensor Core era have to make warps explicit? Because the mma.sync instruction is physically warp-level — 32 threads must cooperatively issue it; elements of matrices A and B are distributed across the registers of these 32 threads in a specific layout, and the Tensor Core hardware gathers all threads’ register contents to feed the array. A 16×16×16 matmul needs 512 FP16 inputs = 1024 bytes, which doesn’t fit in a single thread’s registers — it must be spread across 32 threads. So the programmer must upgrade from “what does my thread compute” to “what does our warp jointly compute”.

Cluster’s Introduction Rationale — Multiple SMs Jointly Feed One mma

Cluster follows the same logic — Hopper-era Tensor Core throughput grew to need multiple SMs feeding data, so clusters were introduced so that blocks in a GPC share SMEM (distributed SMEM) and TMA. Blackwell’s 2-CTA cluster further binds two SMs into a group to jointly feed one UMMA Tensor Core instruction.

Code Comparison — naive matmul → wmma → wgmma+TMA → tcgen05+TMEM

The most direct experience is to see how the same task (matmul) evolves across generations. The four code snippets below show the typical matmul shape for Pascal → Volta-Ampere → Hopper → Blackwell.

Pascal · Naive Thread Level — Each Thread Computes One Element

__global__ void matmul_naive(float* A, float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row >= N || col >= N) return;

    float sum = 0;
    for (int k = 0; k < N; k++) {
        sum += A[row * N + k] * B[k * N + col];
    }
    C[row * N + col] = sum;
}

Thinking granularity: single thread. Each thread computes one element of C, looping N times for the inner product. Code readability is high — but performance is only 5-10% of theoretical peak (no Tensor Core, data loaded directly from GMEM).

Volta–Ampere · wmma Fragment — Warp Cooperation · First Tensor Core

#include <mma.h>
using namespace nvcuda::wmma;

__global__ void matmul_wmma(half* A, half* B, float* C, int N) {
    __shared__ half tile_A[16][16];
    __shared__ half tile_B[16][16];

    // fragments live in the registers of all 32 threads in the warp
    fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
    fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
    fragment<accumulator, 16, 16, 16, float> c_frag;
    fill_fragment(c_frag, 0.0f);

    for (int k = 0; k < N; k += 16) {
        // Async load into SMEM (cp.async from Ampere)
        // ... loading logic omitted ...
        __syncthreads();

        load_matrix_sync(a_frag, &tile_A[0][0], 16);   // warp-cooperative
        load_matrix_sync(b_frag, &tile_B[0][0], 16);

        mma_sync(c_frag, a_frag, b_frag, c_frag);      // one instruction = 16×16×16 matmul
        __syncthreads();
    }

    store_matrix_sync(&C[...], c_frag, N, mem_row_major);
}

Thinking granularity upgrades to the whole warp. Note several key changes — every wmma API ends with _sync, meaning “the whole warp must call and execute it synchronously”; the code has no threadIdx — fragments hide the thread-level complexity; mixed precision appears for the first time (input half, accumulator float). Performance can reach 60-80% of peak.

Hopper · TMA + Warp Specialization — Producer / Consumer Pipelined Overlap

// Extremely simplified — real CUTLASS code is several times longer
__global__ void matmul_hopper(...) {
    __shared__ AsyncBarrier bar;
    extern __shared__ half smem_buffer[];

    int warpId = threadIdx.x / 32;

    if (warpId == 0) {
        // Producer warp — issues TMA loads only, no compute
        for (int k = 0; k < N; k += TILE_K) {
            cuda::memcpy_async(smem_A, gmem_A_tile, tma_desc_A, bar);
            cuda::memcpy_async(smem_B, gmem_B_tile, tma_desc_B, bar);
            bar.arrive();
        }
    } else {
        // Consumer warps — drive Tensor Cores, no loads
        for (int k = 0; k < N; k += TILE_K) {
            bar.wait();                                   // wait for data arrival
            wgmma::mma_async(c_frag, smem_A, smem_B);     // async mma!
            wgmma::commit_group();
            wgmma::wait_group<0>();
        }
    }
}

Thinking granularity upgrades again — warp specialization. Different warps do completely different work — some only move data (producer), some only drive Tensor Cores (consumer); the two pipes synchronize via async barrier. wgmma (warpgroup MMA) replaces wmma; operands can come directly from SMEM rather than first being loaded into registers. Async barriers don’t count threads — they count bytes — “I wait for 16 KB of data to land.” Performance reaches 85-95% of peak, but such code is essentially writable only by CUTLASS / FlashAttention teams.

Blackwell · tcgen05 + TMEM — 2-CTA Cluster · Operands Bypass Registers

__global__ void matmul_blackwell(...) {
    // Explicit TMEM allocation
    auto tmem_c = tcgen05::alloc<float, 128>();
    __shared__ AsyncBarrier bar;

    if (in_producer_warp) {
        // TMA multicast — one load, both SMs in the 2-CTA cluster receive
        cuda::memcpy_async_tma_multicast(smem_A, gmem_A, cluster);
        cuda::memcpy_async_tma_multicast(smem_B, gmem_B, cluster);
    } else {
        // Operands come from SMEM, result writes into TMEM (no register pressure)
        tcgen05::mma(tmem_c, smem_A, smem_B);
        tcgen05::commit();
    }

    // Read result out of TMEM
    tcgen05::load(c_reg, tmem_c);
    tcgen05::dealloc(tmem_c);
}

Another manual-management layer is added — explicit TMEM alloc / dealloc. The tcgen05.mma instruction takes operands from SMEM and stores accumulators in TMEM, fully bypassing the register file. The 2-CTA cluster lets two adjacent SMs share data via distributed SMEM + TMA multicast and jointly feed one mma. A single activation of the cluster does an enormous amount of work.

Per-Generation Complexity Comparison — Each Generation 2-3× More Code

Put the four snippets together and the evolution is plain:

Generation	Thinking granularity	New things the programmer must manage	Code volume
Pre-Pascal	Thread	thread / block / grid trio	~20 lines
Ampere	+ Warp	fragment layout / `_sync` semantics / cp.async	~100 lines
Hopper	+ Cluster	TMA descriptors / warp specialization / async barrier	~300 lines
Blackwell	(Cluster upgrade)	TMEM alloc/dealloc / tcgen05 / 2-CTA / multicast	~500 lines

Each generation roughly 2-3× more complexity. But this is only the bare-CUDA perspective — in the real world, 90% of users don’t write this kind of code.

Abstraction Layers Absorb the Complexity — torch.matmul Doesn’t Change · cuBLAS Is Rewritten Each Gen

Code Volumes Across Five User Tiers — From 1 Line to 1000 Lines of Assembly

This is the essence of NVIDIA’s design philosophy — hardware complexity exploded, absorbed by layered abstractions, while the regular user’s experience stayed stable or even simpler.

For the same matmul task, the code volume across user groups:

User group	Invocation	Code volume	Performance	What you must know
Regular AI engineer	`torch.matmul(a, b)`	1 line	near peak	almost nothing
Custom kernel engineer	Triton `@triton.jit` + `tl.dot`	~50 lines Python	80-95% peak	tile partitioning / SMEM concept
High-performance library developer	CUTLASS templates	~100 lines C++	near peak	Tensor Core / CuTe layout
Bare-CUDA engineer	direct `mma.sync`	~300 lines C++	depends on skill	warp / SMEM banks / mma instructions
Extreme optimizer	inline PTX	~1000 lines assembly	theoretical limit	register allocation / instruction scheduling / dependency tracking

The further down, the more control, but the work explodes. FlashAttention’s author Tri Dao writes at the bare-CUDA + inline PTX level, so one person can write some kernels faster than the cuBLAS team. But the vast majority of AI engineers stay at the top tier.

Why Hand-Written Code Isn’t Always Faster Than cuBLAS — NVIDIA’s Moat Is More Than Hardware

An interesting reversal — 90% of the time, PyTorch + cuBLAS beats a typical engineer’s hand-written CUDA by a wide margin. Three reasons:

cuBLAS is tuned by NVIDIA engineers per generation; they know every detail of Tensor Cores (including undocumented ones).
cuBLAS internally is a kernel library — for the same matmul, dozens of kernel implementations exist, and at runtime the optimal one is auto-picked by shape.
Optimizations few would think of — double buffering, warp specialization, bank-conflict avoidance, register pressure balancing.

So NVIDIA’s moat is not just the hardware, it’s this whole top-to-bottom software stack. This is also why AMD GPUs can numerically catch up to or even surpass NVIDIA, but the ecosystem is far behind — hardware can be matched; the libraries and compilers accumulated over the past decade-plus cannot.

Software Stack and Open-Source Landscape — Open Shell · Closed Core

CUDA Software Stack in Layers — L7 Frameworks → L1 Hardware · A Seven-Tier Slice

Draw the CUDA library ecosystem accumulated over a decade-plus, and a counterintuitive truth emerges — the most central and most used libraries are precisely the closed-source ones, while open-source ones are upper, peripheral, or customization-facing.

CUDA software stack and open-source status — red frames are NVIDIA-closed (cuBLAS / cuDNN / cuFFT / CUDA Runtime); green frames are fully open (CUTLASS / Triton / NCCL / Transformer Engine / Megatron / vLLM); orange frames are partially open (TensorRT family). The most critical libraries are precisely the closed-source ones.

Open vs Closed Boundary — The Most-Used Libraries Are Precisely Closed

Many people mistakenly assume cuBLAS and cuDNN are open source — in fact they are NVIDIA’s most tightly guarded products. pip install nvidia-cublas-cu12 installs not source code but precompiled .so binaries; the license field clearly states LicenseRef-NVIDIA-Proprietary.

Library	Open-source status	License	What you can see
CUDA Runtime / Driver	closed	NVIDIA Proprietary	only headers and .so
cuBLAS / cuBLASLt	closed	NVIDIA Proprietary	only headers and .so
cuDNN (core)	closed	NVIDIA Proprietary	only headers and .so
cuDNN Frontend (C++ wrapper)	open	MIT	wrapper, calls closed cuDNN
cuFFT / cuSPARSE / cuSOLVER	closed	NVIDIA Proprietary	only headers and .so
CUTLASS	fully open	BSD-3-Clause	full C++ template source
Triton (OpenAI)	fully open	MIT	full source
NCCL	fully open	BSD-3	full source
NVSHMEM (post-2024)	open	BSD-3	full source
Transformer Engine	fully open	Apache 2.0	full source
TensorRT	partial	Apache 2.0 (plugins)	only plugins, parsers; core closed
TensorRT-LLM	partial	Apache 2.0 (frontend)	frontend Python, depends on closed TensorRT
RAPIDS (cuDF / cuML)	fully open	Apache 2.0	full source
Megatron-Core / NeMo	fully open	Apache 2.0	full source

The Real Moat — Not a Single Library · The Decade-Plus Ecosystem

NVIDIA pursues an “open shell + closed core” product strategy: users can extend (write custom plugins, modify CUTLASS templates, customize NCCL communications) but cannot copy the core implementations. Even if AMD ROCm open-sources the equivalent libraries (rocBLAS / MIOpen are both open), performance still trails cuBLAS / cuDNN by 2-3 years — because the optimization tricks of cuBLAS / cuDNN (the latest Tensor Core instructions, warp scheduling, TMA usage) live in the binary, and AMD can’t “copy” them — only reverse-engineer or reinvent them.

This is NVIDIA’s true moat — not any single library, but dozens of libraries layered over a decade-plus.

Summary — Scale Tensor Cores · Scale Precision · Scale Abstraction Layers

Three Main Threads — Tensor Core · Precision · Abstraction

Compressing the decade into one sentence: NVIDIA did three things this decade — scale Tensor Cores, scale mixed precision, scale abstraction layers to absorb complexity.

Specifically:

Scale Tensor Cores — essentially all compute gains come from here. On a B200, Tensor Core throughput is ~60-250× CUDA Core. It is not a faster CUDA Core; it is a completely different systolic-like array.
Scale precision — push one level down each generation. FP32 → TF32 → FP16/BF16 → FP8 → FP4. Each step packs 4× more MACs into the same silicon, doubling LLM inference throughput for free.
Scale abstraction layers — torch.matmul(a, b) doesn’t change a single character, while cuBLAS / cuDNN gets rewritten each generation. CUTLASS / Triton / Transformer Engine absorb hardware complexity, letting users stay at the “PyTorch is enough” experience.

Why Tensor Cores — Physical · Architectural · Commercial

If you ask “why Tensor Cores instead of CUDA Cores,” the answer lives at three layers:

Physical layer — CUDA Cores are near frequency limits; piling on SMs is constrained by power and thermals; Tensor Cores still have two open paths: precision shrink and array scale.
Architectural layer — Tensor Cores’ hardwired control makes control overhead nearly zero; effective compute per silicon area is much higher.
Commercial layer — Tensor Cores + closed cuBLAS / cuDNN are NVIDIA’s true moat; no matter how fast AMD hardware gets, it can’t replicate this software stack.

Five-Year Outlook — FP2 · Larger Clusters · Platformization

The trajectory continues for the next 5 years — Rubin Ultra (2027) / Feynman (2028) are already on the roadmap, FP2 / INT2 may appear in the generation after that, clusters may scale from 8-16 blocks to dozens, TMEM may further specialize into more dedicated storages. For you, once you grasp this main thread, any future NVIDIA news, architecture, or product can be located immediately: it’ll be one more notch on one of “scale Tensor Cores / scale precision / scale abstraction layers.”

References — Whitepapers · Papers · Docs

Whitepapers and Official Documentation

NVIDIA Architecture Whitepapers — each release ships with a very detailed Architecture Whitepaper; this is the most authoritative primary source. At minimum read the Hopper and Blackwell whitepapers.
CUDA C++ Programming Guide — official handbook; chapter 3 on async copy, chapter 5 on Tensor Cores, chapter 7 on clusters give the concrete syntax of the model’s expansion.
CUTLASS GitHub — the latest CUTLASS 4.x + CuTe DSL is the most authoritative open-source implementation pushing every architectural feature to its limit. github.com/NVIDIA/cutlass

Papers and Engineering Blogs

Tri Dao’s FlashAttention papers — FlashAttention (2022) / FA-2 (2023) / FA-3 (2024). Canonical example of stringing TMA + warp specialization + Tensor Cores; FA-3 exploits nearly every Hopper feature.
Colfax Research TMA / wgmma tutorials — the clearest engineering-level walkthrough of Hopper’s TMA, wgmma, and async barriers.
Aleksa Gordić’s “matmul anatomy” blog — step-by-step derivation from a naive matmul to a CUTLASS-style kernel; one of the best continuous reads from beginner to advanced.

Textbooks and Courses

“Programming Massively Parallel Processors” (Kirk & Hwu) — GPU parallel programming textbook covering hardware, CUDA, and distribution. From the 5th edition onward, it added much Volta-Ampere era content.
CMU 15-418 / Stanford CS149 — lecture notes and assignments from both are public, covering parallel architectures, SIMT, and memory models more clearly than most books.

GPU
CUDA
NVIDIA
Tensor Core
Blackwell
architecture
programming model

2026 · 05 · 11