The AI Inference Chip Spectrum — Seven Gradients from General GPU to Model-Etched Silicon

If we had to summarize the 2025-2026 AI inference chip landscape in a single sentence, it’s a spectrum: from NVIDIA GPUs’ “runs anything” to Taalas HC1’s “runs one model only,” with seven precisely arranged gradients in between. Each step to the right brings a 3-10× speedup, but trims away a slice of model coverage.

This isn’t a story of NVIDIA-takes-all. AI inference will account for roughly two-thirds of all AI workloads in 2026, large enough to support a fleet of specialized chip companies; large-model architectures unexpectedly converged on the Transformer after 2022, giving “design a circuit for one neural-network family” commercial meaning for the first time; meanwhile, the power bottleneck in data centers has made energy efficiency more valuable than raw compute — these three forces hitting simultaneously is what pushed non-general architectures from academic papers into commercial reality.

This article doesn’t organize by vendor; it organizes by architectural principle. We’ll walk left-to-right along the spectrum, examining what physical mechanism each gradient uses to push “specialization” one step further, and what price it pays. At the end we discuss the photonic path separately — it isn’t on the main spectrum, but it has a completely different fate along the parallel branch of “optical compute vs optical interconnect.”

Overview — a spectrum compressing seven paths

A Spectrum Diagram — seven gradients from GPU to Taalas

Lay the seven gradients side-by-side on one axis and you see a clean continuum — colors transitioning linearly from cool (general) to warm (specialized), speeds racing from tens of tok/s on the left to tens of thousands on the right.

AI Inference Chip Spectrum · General → SpecializedFlexibility decreases · speed increases · valuation gradientGENERALSPECIALIZEDNVIDIA GPUCUDA + Tensor~50 tok/sAny operatorCerebras WSE-3Wafer-scale~2,000Large models · generalGoogle TPUSystolic array~150 (Trillium)Mainstream dense modelsGroq LPUStatic dataflow~600Compile-time fixed graphd-Matrix CorsairDigital in-memory~500 (2ms/tok)Transformer familyEtched SohuTransformer ASIC~60,000 (L70B)Transformer onlyTaalas HC1Model etched~17,000 (L8B)Single modelColor gradient:General computeDomain-specific (DSA)Architecture-specialized (in-memory / dataflow)Model etchedSpeed = single-user single-stream tok/s · benchmark Llama 70B (except where Taalas notes otherwise)Flexibility = range of models supported · narrower to the rightValuation gradient = further right “bets” on a single architecture · further left “bets” on the ecosystem
The seven gradients run from general parallelism (NVIDIA GPU) all the way to model-etched silicon (Taalas HC1). Each color block represents a unique way of organizing the circuit — the degree of trim/lock/burn increases, and speed and flexibility form a precise inverse gradient.

A subtle point on this spectrum: Cerebras is not on the “specialization spectrum,” it’s “a different physical implementation” of general parallelism — using an entire wafer to solve the inter-chip communication bottleneck rather than modifying the circuit topology. But because it sits roughly alongside NVIDIA GPUs on the “general” dimension, I’ve placed it at the far left, near the GPU. We’ll cover this distinction in detail later.

Why 2025-2026 — four conditions met simultaneously

This spectrum didn’t appear in 2026. Systolic arrays have papers dating to 1979, in-memory computing has been bouncing around academia since the 2010s, photonic computing can be traced to optical neural networks in the 1980s. The new thing isn’t the technology — it’s that they all became commercially viable at the same time. Four conditions came together:

Remove any one of these conditions and the spectrum doesn’t hold. The 2018 compute market wasn’t big enough to support specialized ASICs; 2020’s FP32 dominance left analog computing with insufficient precision; in 2014, RNN/Transformer/CNN coexistence meant no one dared “bet on one architecture.” 2025-2026 is the window — and that window probably lasts 5-7 years, the period of Transformer dominance.

Seven-Gradient Comparison — speed vs flexibility

A summary table of the key parameters across seven gradients — this is the “cheat sheet” for everything that follows:

GradientRepresentative ProductArchitectural CoreLlama 70B Single-Stream tok/sFlexibilityStatus
1NVIDIA H100 / B200CUDA + Tensor Core~50Any operator + training + inference + HPC>80% training market share
2Cerebras WSE-3Wafer-scale integration~2,000Any model + training + inference2026/5 IPO, ~$49B valuation
3Google TPU v7Systolic array ASIC~150 (Trillium)Mainstream dense modelsScaled, Anthropic 1M TPUs
4Groq LPUStatic dataflow~600Any Transformer6.9Bvaluation,NVIDIA6.9B valuation, NVIDIA 20B backing
5d-Matrix CorsairDigital in-memory computing~500 (2ms/tok)Transformer familyShipping, $2B valuation
6Etched SohuTransformer ASIC~60,000 (8 cards)Transformer onlyShipping to early customers
7Taalas HC1Model etched~17,000 (Llama 8B single user)Single modelReleased 2026/2, 2-month tape-out hedge

Note that the tok/s for gradients 6 and 7 are under different benchmarks — Etched’s 60K is the total throughput of an 8-card server on Llama 70B, while Taalas’s 17K is single-chip single-user throughput on Llama 8B. The two can’t be directly compared — the further right you go in specialization, the less meaningful “fair comparison” becomes, because the scope of what they can run has already diverged.

General Parallelism — NVIDIA GPU and Cerebras Wafer

The two leftmost gradients on the spectrum both belong to “general parallelism” — they can run any operator, both training and inference. But their physical implementations diverge completely: NVIDIA bets on “extreme single-card + high-speed interconnect to form clusters,” while Cerebras bets on “eliminate inter-chip communication with an entire wafer.” Neither path changed the essence of neural network computation; they just optimize the physical form of “general parallel architecture” from different angles.

NVIDIA GPU — the dual-track of CUDA + Tensor Core

The evolution of NVIDIA’s data center GPUs has been broken down in detail in GPU Architecture: Ten Years of Evolution — here we only flag the two key points relevant to spectrum positioning.

NVIDIA DGX B200 server
NVIDIA DGX B200 — 8 Blackwell GPUs interconnected via NVLink, representing the current flagship of the “extreme single-card + high-speed interconnect” path (source · NVIDIA DGX B200 product page). The “general” gradient at the far left of the spectrum — the same chip can run training, inference, HPC, graphics.

First, NVIDIA GPUs are a dual-track architecture: CUDA Cores handle general parallel computation (any instruction, any data type), while Tensor Cores handle matrix multiply acceleration. A B200’s CUDA Core throughput is only 80 TFLOPS, but its Tensor Core runs at 9,000 TFLOPS in FP4 — there are two completely independent circuits on the same chip. This means an NVIDIA GPU is actually a hybrid on the “general vs specialized” spectrum — the Tensor Core part is already quite specialized, but because the CUDA Core part is extremely general, the overall position remains “the far left.”

Second, NVIDIA GPUs’ “waste” on inference is their biggest weakness. A Llama 70B inference task is mainly matrix multiply + softmax + LayerNorm + activation functions — the general portion of CUDA Cores sits largely idle, the RT Cores (ray tracing) and video codec engines aren’t used at all. Most of the GPU’s silicon budget is essentially “on standby,” with only Tensor Cores doing work. That’s why the other six gradients dare to challenge the GPU — they all target this fundamental waste of “cutting away everything that isn’t serving inference.”

NVIDIA is well aware. The Transformer Engine starting with Hopper, Blackwell’s FP4 data path, Rubin’s continuing increase in Tensor Core share within each SM — NVIDIA is quietly dragging itself toward the right of the spectrum, only keeping the CUDA Core “general retreat path” out of ecosystem considerations.

Cerebras WSE-3 — an entire wafer eliminating inter-chip communication

Cerebras chose a deeply counterintuitive path — everyone else cuts small chips out of wafers; Cerebras uses an entire 300 mm wafer as one chip.

Cerebras WSE-3 wafer-scale chip
Cerebras Wafer-Scale Engine 3 — a single 300 mm wafer made into one chip, 46,225 square millimeters, 4 trillion transistors, 900,000 cores, 44 GB on-chip SRAM (source · cerebras.ai/chip). What it solves isn’t “faster compute” but the “inter-chip communication bottleneck” — letting an entire model live on a single “chip.”

WSE-3 occupies an entire 300 mm wafer, 46,225 square millimeters, 57× larger than an H100, with 4 trillion transistors, 900,000 cores, 44 GB of on-chip SRAM, and 21 PB/s on-chip memory bandwidth. It solves engineering problems that were considered unsolvable for decades — wafer-level yield (via redundancy + interconnect rerouting), power delivery (liquid cooling + high-density power grid), thermals (on-chip liquid cooling channels).

But Cerebras’s key insight isn’t about “faster compute”; it’s that inter-chip communication is 100× slower than on-chip communication:

That’s why Cerebras hits ~2,000 tok/s on low-latency inference — not because per-unit compute is faster than the GPU, but because it eliminates the big block of time spent “waiting for data to arrive from a neighboring chip.”

But Cerebras is still “general parallelism” — the compute units inside the chip can run any PyTorch operator, not locked to Transformers. The OpenAI contract signed in early 2026 (over $10B, 750 megawatts) was specifically about this point — not wanting to be locked into a specific architecture, while needing ultra-low latency. This is Cerebras’s biggest difference from the Transformer-specific chips that follow.

Why Cerebras Sits on the “General” Side — process innovation vs architectural specialization

An easy trap is to classify Cerebras as a “specialized ASIC” — because it “doesn’t look like a traditional chip.” But Cerebras’s innovation dimension lies entirely off the “specialization” axis:

DimensionNVIDIA GPUCerebras WSE-3TPU / Groq / d-Matrix…
Architectural innovationGeneral parallel + heterogeneous acceleratorsWafer-scale integration (physical form)Cut general portion, specialize for matmul
What it can runAny operatorAny operatorMatmul + mainstream NN operators
Problem solvedCompute upliftInter-chip communication bottleneckEnergy waste in general-purpose hardware

Cerebras’s physical-form innovation is orthogonal to “specialization” — you could absolutely build a “wafer-scale + in-memory compute” chip that eliminates both inter-chip communication and the SRAM-to-compute-unit shuttling within a die. It’s just engineering-prohibitively difficult; no one can do two radical things simultaneously. Cerebras picked the wafer-scale side, d-Matrix picked the in-memory side, and each is digging into their respective side first.

So the two gradients on the far left aren’t a “ranking comparison”; they’re “different bottlenecks each have their own attackers” — NVIDIA attacks “extreme single-card,” Cerebras attacks “inter-chip communication.”

Systolic Arrays — Cutting Away the General Part — Google TPU

Starting from this gradient, we enter actual “specialization” — the TPU extracts the Tensor Core from inside the GPU and scales it up, cutting away CUDA Cores and other general computation parts, with all silicon area devoted to matmul.

The Systolic Array Circuit — a “scaled-up” Tensor Core

Google TPU v7 Ironwood chip
Google TPU v7 Ironwood, on display at SC25 — the seventh-generation TPU, and the first to split into two different chips: training-specific (TPU 8t) and inference-specific (TPU 8i) (source · ServeTheHome). TPU 8i pairs 288 GB HBM with 384 MB on-chip SRAM, optimized for MoE and long-context.

The core idea of the systolic array, proposed by H.T. Kung in 1979, is: data flows through a fixed array of processing elements (PEs) like a heartbeat, with each PE doing one multiply-add and passing the result to a neighbor or accumulating locally. This means:

But the systolic array has a fatal weakness — it only suits regular matrix multiplies. As soon as computation involves irregular memory access (sparse attention, dynamic shapes, complex control flow), the systolic array runs poorly. That’s why TPUs run Transformer inference smoothly but struggle with state-space models like Mamba.

The seventh-generation Ironwood pushed this “specialization” to the point where Google itself felt the need to split — they released TPU 8t (training-specific) + TPU 8i (inference-specific), two different chips. The TPU 8i pairs 288 GB HBM with 384 MB on-chip SRAM, optimized specifically for MoE and long-context models. This means even Google has acknowledged “training and inference need different hardware” — something unimaginable in the TPU v3 era of 2018.

TPU 8t / 8i Training-Inference Split — first acknowledgment that the two ends need different hardware

Why split? Because the workload shapes for training and inference are fundamentally different:

Doing both tasks on the same chip leads to severe waste — a training chip doing inference has most of its circuitry idle; an inference chip doing training can’t sustain backpropagation. TPU 8t / 8i is Google’s first physical split of these two workloads — the inference chip is extremely optimized for “single-stream low latency + long-context KV cache,” while the training chip is extremely optimized for “cluster coordination + large-batch throughput.”

This has industry-level signal value: specialization has fractionated to the precision of “the same operator needs different hardware under different workloads.” We’ll see later that Groq, d-Matrix, and Etched all only do inference — training simply doesn’t show up on the specialization spectrum, because algorithms are still evolving, and no one dares to tape out for a specific training flow.

Same-Path Players — AWS Trainium · Huawei Ascend · Meta MTIA

Google isn’t alone on the systolic array path — almost every cloud provider has gone this route:

Put these together and an industry-level pattern emerges — every company with “its own cloud + sufficient model scale” is doing a systolic array ASIC. The reason is simple: bypass NVIDIA’s high margins (40-70%) and internalize that profit. The moat of this path isn’t technology — systolic arrays aren’t that mysterious — it’s “own-cloud internal shipment volume sufficient to amortize tape-out cost.” That’s why independent systolic-array companies have a hard time surviving, but cloud providers doing this make money.

Compile-Time Frozen Scheduling — Groq LPU

One gradient to the right, we arrive at the Groq LPU (Language Processing Unit) — on top of the systolic array’s “cut away the general part,” it cuts away another thing: runtime scheduling.

Deterministic Dataflow — the price of no dynamic scheduling

Groq LPU card
Groq LPU (Language Processing Unit) — founder Jonathan Ross was a co-author of the original Google TPU and felt that the TPU wasn’t extreme enough. The LPU hardcodes the dataflow path entirely at compile time, with no dynamic scheduling, no cache misses, and no branch mispredictions at runtime (source · ServeTheHome).

Groq founder Jonathan Ross was a co-author of the original Google TPU. He felt that the TPU wasn’t extreme enough — although the TPU cut away CUDA Cores, it retained the traditional CPU model of “execute by instruction,” with runtime complexity from warp scheduling, cache hierarchy, branch handling, etc. Ross’s insight: for neural network inference, all of that “dynamic behavior” is waste.

The Groq LPU’s core architecture is called the “statically scheduled Tensor Streaming Processor” (TSP):

This “determinism” yields three direct benefits:

The price is: long compile time, expensive model switches. Changing a model is like redesigning the loom. But Groq solved this elegantly — treat compilation as a one-time investment, then expose compiled models as a “service” (the GroqCloud API). Developers pay per token and don’t have to run compilation themselves.

All-On-Chip SRAM — the no-HBM design philosophy

The Groq LPU has another counterintuitive design choice: no HBM at all.

Each LPU has 230 MB of SRAM, with no HBM and no DRAM. SRAM capacity is small, but bandwidth is several times HBM’s, and energy consumption is an order of magnitude lower. The problem is 230 MB per chip can’t hold a large model — a Llama 70B requires 140 GB (FP16). Groq’s answer: form clusters. Hundreds of LPUs use proprietary high-speed interconnect to form a cluster, each LPU holding a slice of the model.

This brings us back to the same “inter-chip communication” problem Cerebras solves — Groq’s answer is “self-developed ultra-low-latency interconnect,” not as general as NVLink, but enough to make hundreds of LPUs work as a single unit.

Not depending on HBM is a hidden advantage of Groq. Against the backdrop of HBM shortages in 2024-2026 (Samsung / SK Hynix / Micron capacity all locked up by NVIDIA), no HBM = no queue. Groq can scale capacity independently, one of the hardware reasons it could ramp quickly in 2024-2025.

2 Million Developers + NVIDIA’s $20B Endorsement — the strongest commercial validation

Groq’s commercial progress is the fastest on the spectrum:

But the most significant event was the NVIDIA-Groq deal in early 2026: NVIDIA struck an approximately $20B agreement with Groq, licensing Groq’s AI inference technology and absorbing several Groq executives into NVIDIA. The implication is crystal clear — NVIDIA itself wants Groq’s path capabilities but doesn’t want to compete, so it chose “acquire the technology + recruit the people” instead of “compete head-on.”

This is the one time NVIDIA has formally acknowledged the value of a non-GPU path across the entire spectrum. It means the “compile-time frozen + all-on-chip SRAM” path that Groq represents has an advantage on low-latency inference that no GPU modification can match — a judgment NVIDIA cast its $20B vote for.

Storage and Compute Physically Fused — d-Matrix digital in-memory computing

The middle gradient of the spectrum looks “counterintuitive” — let storage units do compute themselves, or physically fuse compute units with storage units. This is the digital in-memory computing (DIMC) path that d-Matrix takes.

The Memory Wall — moving data costs 10× more than computing

To understand why anyone would do this, you need to understand a fundamental waste in traditional chips — the “memory wall” problem:

Von Neumann vs Digital In-Memory Compute (DIMC)Where weights “live” · determines the dominant energy costTraditional Von Neumann / GPUWeights live outside · repeatedly shuttledHBM (weight storage)SRAM cacheTensor Core (compute)Move energy≈ 10 × compute energyData travels ~few cm (off-chip)Reread all weights per tokenResult written back to SRAMDigital In-Memory Compute (DIMC)Weights resident · inputs broadcastSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACSRAM+MACInput x(broadcast)Output y(direct)Weights resident · distance < 1 mmData shuttling → nearly eliminatedGPU path:Fetch weights → compute → write back → reread next layerDIMC:Broadcast input → entire die fires simultaneously → output
Left · the “Von Neumann” structure of traditional GPUs — weights live in HBM, and each token goes through “fetch weights → compute → write back,” with move-energy ≈ 10× compute-energy. Right · DIMC physically fuses storage units and multiply-adders within small chiplets, weights permanently resident, with only inputs broadcast at inference time — data shuttling nearly eliminated.

A GPU running Llama 70B inference, for every token generated, in theory has to read all 70 billion weight parameters from HBM through the compute units once. The energy of shuttling that data is more than 10× the multiply-add operation itself. Most of the H100’s silicon budget is spent solving “how to move weight data to compute units faster” — HBM3, L2 cache, TMA async transfer, layer upon layer of optimization — but it’s all “mitigation,” not “cure.”

d-Matrix’s core reframe: since every inference has to read the weights, why not just compute them where they’re stored?

Stuffing Multipliers Next to SRAM — a chiplet grid

d-Matrix Corsair inference accelerator card
d-Matrix Corsair — a full-length, full-height PCIe Gen5 inference accelerator card. Based on 6nm Nighthawk and Jayhawk II chiplets, with each Nighthawk integrating 4 neural cores and a RISC-V CPU via chiplet packaging (source · d-matrix.ai/product). A single card hits 30,000 tokens/s on Llama 70B with 2 ms per-token latency.

What d-Matrix concretely does is digital in-memory computing (DIMC) — rather than letting storage units do compute themselves, it places the compute unit tightly adjacent to the storage array. The two are interleaved physically within a chiplet, weights permanently resident, no repeated shuttling needed.

Versus “analog in-memory computing” (the path that Mythic and EnCharge AI take) — analog schemes attempt to encode weights directly as conductance values of devices like ReRAM, using Ohm’s law + Kirchhoff’s current law to physically perform matrix multiply as current flows through the array. Energy efficiency could theoretically be an order of magnitude higher, but the engineering difficulties are enormous (ADCs are too expensive, write precision is poor, temperature drift, finite lifespan).

d-Matrix actually tried the analog path early on (the Nighthawk concept chip in 2020), but quickly gave up — “putting an ADC on every bitline is too hard.” Ultimately they went with digital IMC (DIMC), sacrificing some of analog’s peak efficiency in exchange for engineering feasibility + controllable precision. This is a very honest call.

Engineering-wise, the d-Matrix Corsair’s concrete design:

d-Matrix’s next-generation roadmap is even more aggressive — partnering with Alchip to build the world’s first 3D-stacked DRAM solution, 3DIMC, debuting on Corsair’s successor Raptor inference accelerator, claimed to be 10× faster than HBM4 solutions. This pushes “in-memory” thinking further from SRAM to DRAM.

Intra-Layer Parallel + Inter-Layer Pipelined — fire simultaneously · advance in pipeline

DIMC and GPUs have a subtle but key difference in “how computation happens” — intra-layer fires simultaneously, inter-layer advances in pipeline:

Intra-layer parallel + inter-layer pipelinedEach layer completes instantaneously · multiple layers advance in pipelineL1L2L3L4h₁h₂h₃xyAll PEs fire simultaneouslyAll PEs fire simultaneouslyAll PEs fire simultaneouslyAll PEs fire simultaneouslyPIPELINE · T1 → T2 → T3T1T2T3L1:L2:L3:xx’h₁x”h₁’h₂→ Time (clock)Intra-layer = fire simultaneously (~1 clock)·Inter-layer = pipelined advance (k clocks for k layers)·Steady-state throughput = 1 tok/clock
All PEs inside a layer fire simultaneously (weights are already resident, multiply-add fires the moment inputs are broadcast in), completing one layer in one clock. Inter-layer, data flows in a pipeline — at T1 token x is computing in L1, at T2 h₁ is computing in L2 while x’ starts in L1. In steady state, one token is produced per clock — this is the fundamental reason d-Matrix can hit “2 ms per token.”

This is fundamentally different from Google TPU’s systolic array:

The “flow” in a systolic array is fine-grained — data really is moving step by step between cells. In-memory computing is coarser-grained — data flows between layers, but within a layer it’s instantaneous. Inter-layer is like a conveyor belt, one layer after the next; intra-layer is like an explosion, the entire layer completing in an instant.

Time Multiplexing When Model > Hardware — slicing the model

DIMC has an implicit limit: the chip’s hardware capacity determines how much of the model can fit. A single Corsair card has 2 GB high-performance memory + 256 GB capacity memory, which can’t hold all layers of a Llama 70B resident simultaneously.

The practical engineering approach is “time multiplexing”:

This sounds like a regression back to GPU behavior — the GPU also runs infinitely many layers with finite compute units. But the key difference is shuttle frequency:

DIMC’s core advantage isn’t “completely eliminate weight shuttling,” but reduce shuttle frequency from “once per token” to “once per batch” or less. This is how d-Matrix can deliver several-to-10× GPU energy advantage while preserving the flexibility of “any Transformer.”

Commercially, d-Matrix is the most mature in the middle of this spectrum:

The Transformer Operator Graph Etched Into Silicon — Etched Sohu

Pushing one more gradient to the right of the spectrum, we arrive at an even more radical design — burn the entire Transformer operator graph into a hardwired circuit. Etched Sohu represents this path.

Etching the Operator Graph Into Dedicated Circuits — cutting away all non-Transformer hardware

Etched Sohu Transformer ASIC render
Etched Sohu render (the chip isn’t yet in volume production; only official renders exist) — TSMC 4nm, 144 GB HBM3e, an ASIC built specifically for Transformers. An 8-card Sohu server hits over 500,000 tokens/s on Llama 70B (source · Jon Peddie Research). Cannot run CNNs, RNNs, state-space models — only Transformer-class.

Etched’s core reframe differs from every previous path — since the Transformer has won, why preserve hardware for “other architectures that might appear in the future”?

Concrete approach:

This “trimming” is highly aggressive. By removing all hardware required by non-Transformer neural networks, Etched fits more Transformer-specific compute into the same silicon — on the same TSMC 4nm process, the Sohu’s “effective Transformer compute” is an order of magnitude higher than the H100.

500K tok/s on Llama 70B with 8 Cards — 20× an H100 server

Etched’s performance numbers are stunning:

The number is large enough to invite skepticism, but the underlying logic holds up: roughly 70-80% of an H100’s silicon area is spent on things “not serving Transformer inference” (the general portion of CUDA Cores, RT Cores, graphics-related circuits, various scheduling logic). Sohu cuts all of that, lifting the share of silicon serving pure Transformer inference from ~20% to near 100% — 5× more compute is reasonable, plus the energy efficiency advantage of specialization makes a combined 10-20× not absurd.

Risk — if Transformer is replaced, the chip goes to zero

Sohu’s risk is crystal clear: if the Transformer is replaced by a fundamentally different architecture within 5-7 years, every Sohu chip instantly goes to zero.

Potential threats:

But Etched’s own bet is that the Transformer has won too completely. GPT/Claude/Gemini/Qwen/DeepSeek/Llama are all Transformer variants, hundreds of billions of training investment is all bet on this — the “transition cost” of changing architectures is too high for the industry to push it proactively. This is a high-beta bet: double the valuation if right, zero if wrong.

Commercially, Etched is valued at around $800M, shipping to early customers from 2024. Its lower valuation than d-Matrix isn’t because the tech is worse, but because deeper specialization → more risk → larger market discount. This is the general pattern we’ll return to — valuations correspond precisely to position on the spectrum.

Physically Casting Weights Into Silicon — Taalas HC1

The rightmost gradient of the spectrum is the true “nuclear option” — not only is the architecture etched in stone, the model’s weights are also physically cast into the silicon. This is the path Taalas takes.

The Structured ASIC Path — change 2 mask layers · tape out in 2 months

Taalas HC1 Hardcore Model chip
Taalas HC1 — Canadian company emerged from stealth in February 2026; the first product HC1 physically casts the weights of Llama 3.1 8B into silicon. TSMC 6nm, 815 mm², ~53 billion transistors, ~250 W power (source · taalas.com). A single chip can only run this one model; changing models requires a new tape-out.

Taalas (Canadian company, emerged from stealth in February 2026 with $169M raised), core approach:

“Change only 2 mask layers” is Taalas’s key engineering breakthrough. A chip typically has 10+ mask layers, and a normal tape-out redoes all of them. Taalas builds all reusable circuits as a “base platform,” and only changes 2 interconnect mask layers for a specific model — this brings tape-out cost and time down by ~10×, going from receiving a new model to producing hardware in just 2 months.

The HC1 first-product specs:

Single Chip, Single Model, 17,000 tok/s — 28× B200

Taalas’s performance on the “single model” benchmark is staggering:

This is the most extreme single-point performance on the entire spectrum. But the cost is clear — this single chip can only run Llama 3.1 8B, not Llama 3.2, not Qwen 2.5, and certainly not any non-8B model.

Hedging Model Iteration With “Fast Tape-Out” — 30 tape-outs supports R1-671B

Taalas’s business model is unique — using “fast tape-out” to hedge “model iteration”:

Taalas itself claims 30 tape-outs can support a large model like DeepSeek R1-671B (since 671B is too large, it has to be distributed across many chips, each holding a small slice of weights). This is essentially an “anti-Moore’s law” product strategy — making money not from process advancement but from the engineering capability to “quickly adapt to model changes.”

The biggest risks on Taalas’s path:

But Taalas’s existence itself proves — the industry is willing to put physical capital into the most extreme “specialization” direction. Even if most customers ultimately don’t choose this path, just the existence of the option puts back-pressure on every more-general scheme — d-Matrix and Etched must demonstrate their flexibility premium is worth the several-times speed gap.

The Photonic Path Splits — compute paused · interconnect explodes

That covers the seven main gradients. But one parallel path deserves a dedicated discussion — photonic chips. It’s not on the main spectrum, but it has important intersections with it: photonics is blocked at “compute” by the diffraction limit, but has scaled at “interconnect.”

Photonic Computing’s Physical Advantages — multiply-add natural · energy efficient · distance independent

Photonic matmul · MZI gridLight interference = multiplication + addition · natural parallelismInput + modulationMZI triangular grid (Reck)Detect (Σ)x₁Modulatorx₂Modulatorx₃Modulatorx₄ModulatorMZIθ₁MZIθ₂MZIθ₃MZIθ₄MZIθ₅MZIθ₆PD · I=Σy₁PD · I=Σy₂PD · I=Σy₃PD · I=Σy₄Multiply← modulator (intensity × transmittance)Add← detector summation (intensity superposition)Parallel← WDM (16 wavelengths simultaneously)Bottleneck← diffraction limit → single MZI ≥ 100 μmReck triangular decomposition: N×N unitary = N(N−1)/2 MZIs cascadedθ controlled via thermal / electro-optic phase bias
The physical principle of photonic matmul — input signals become beams of varying intensity via modulators, then pass through a triangular MZI (Mach-Zehnder interferometer) grid, where “addition” is performed at the detectors via intensity superposition. A matmul costs almost no energy — energy is spent mostly on the electro-optic / opto-electronic conversions at both ends. Reck proved in 1994 that any N×N unitary matrix can be decomposed into N(N-1)/2 2×2 rotation matrices, corresponding exactly to MZI counts.

The core insight of photonic computing is simple — matrix multiplication is essentially “multiply” and “add,” and light naturally does both:

This sounds wonderful. The truly attractive physical advantages of photonic computing:

The Diffraction Limit — why photonics can never reach nanoscale density

But photonic computing has a fundamental physical bottleneck — the diffraction limit:

Photonic device sizes cannot be smaller than half the wavelength of the light — that’s a physical law, called the diffraction limit. Mainstream data center optical communication uses 1310 nm or 1550 nm infrared light, with a diffraction limit of ~700 nm (half a micron). In practical engineering, after accounting for manufacturing tolerances, loss management, and wavelength drift, photonic devices must be much larger:

Photonic DeviceTypical Sizevs Electronic
Single optical waveguide (one guide line)500 nm wide, 1-2 μm spacingSingle transistor 20-30 nm
Microring modulator (MRM)5-10 μm diameter-
Mach-Zehnder modulator (MZM)100 μm to several mm long-
Single MZI compute unittens to hundreds of μmSingle Tensor Core ~10 μm

The density gap is ~200-500×. This produces a direct engineering reality — on an H100-sized silicon die (800 mm²), electrons can fit dozens of 1024×1024 matrix multipliers, while photons can fit a maximum matrix multiplier of about 128×128.

Worse — this gap is dictated by physics, and process improvements can’t solve it. Even pushing photonic processes from today’s 45 nm/90 nm to 3 nm, photonic device density barely improves (since it’s not process-limited, it’s wavelength-limited). Unless you use X-ray wavelengths (a few nanometers), but those energies would destroy silicon itself.

Photonic Computing — still in the coprocessor stage

Q.ANT Native Processing Unit
Q.ANT NPU (Native Processing Unit) second generation — German company in Stuttgart, spun out from industrial laser giant Trumpf in 2018, the world’s first commercially shipping fully photonic AI coprocessor. Operating at 30 W vs GPU 700-1000 W, deployed at Germany’s Leibniz supercomputing center and in US-EU data centers (source · Q.ANT press release).

The actual state of photonic computing today — coprocessor, not replacement. Q.ANT is the most demo-worthy representative of this path:

But photonic computing’s fundamental shortcomings remain unsolved:

So the best near-term (within 5 years) positioning for photonic computing is coprocessor — let the CPU/GPU do what it’s good at (control, scheduling, nonlinear ops), and offload matmul to photonics. This is Q.ANT’s actual deployment model.

Photonic Interconnect — GPU cluster scaling already shipping

Lightmatter Passage M1000 photonic interconnect superchip
Lightmatter Passage M1000 — a 4,000+ mm² active photonic interposer in a 3D package containing 34 integrated chiplets, 1,024 SerDes lanes, 256 fibers, delivering up to 114 Tbps total bandwidth (source · ServeTheHome · Hot Chips 2025). This isn’t using light to compute, it’s using light to replace copper for chip-to-chip interconnect.

The other fate of photonics is completely different — photonic interconnect has scaled to commercial use. It solves not the compute bottleneck but the “interconnect wall” — GPU clusters growing larger while electrical interconnect bandwidth can’t keep up.

Key advantages of photonic interconnect:

Why has the AI era forced out photonic interconnect? The data is unambiguous — model parameters have grown 240× in 3 years, cluster scale 10×, but electrical interconnect bandwidth only 2×. The gap widens; copper at 224 Gbps is near its physical limit (faster brings severe crosstalk). Light has to step in.

Commercial progress is fast:

When NVIDIA itself enters photonic interconnect, this path is no longer an “alternative” — it’s the necessary path for AI cluster scaling.

Energy Comparison — 4-5 pJ/bit vs 7-15 pJ/bit

Putting the key metrics of photonic and electrical interconnect side by side:

DimensionElectrical (copper SerDes)Photonic (silicon photonics CPO)
Current state-of-the-art per-bit energy7-15 pJ/bit4-5 pJ/bit
Best lab record1.41 pJ/bit (224 Gb/s, 2022)0.7 pJ/bit (112 GBaud, 2023)
Bandwidth ceiling~224 Gbps/lanedozens of wavelengths multiplexed
Distance attenuationseverenearly none
Process maturityextremely matureGF 45nm/90nm volume
Per-die integration ceilingtens of Tbps>100 Tbps (per package)

The energy advantage of photonic interconnect isn’t “crushing” (only 2-3×); its real edge is bandwidth density + distance independence.

Synthesis — the flexibility-vs-efficiency tradeoff spectrum

Having walked the seven gradients + photonic branch, we can put all the data together for an overall synthesis.

Speed Across the Seven Gradients — measured Llama 70B / 8B data

Putting the measured throughput data for every product on the spectrum together:

PathRepresentative ProductStatusLlama 70B (8 cards)Llama 70B (single stream)Physical Implementation
General GPUNVIDIA H100scaled~23,000 tok/s~50 tok/sCUDA + Tensor Core
General GPUNVIDIA H200scaled~31,712 tok/s~70 tok/ssame + HBM3e
General GPUNVIDIA B200scaled~45,000 tok/s~120 tok/sBlackwell
Wafer-scaleCerebras WSE-3commercial-~2,000 tok/sentire wafer
Static dataflowGroq LPUscaled-~600 tok/scompile-time frozen
Digital in-memoryd-Matrix Corsairshipping-~500 tok/sdigital in-memory compute
Transformer ASICEtched Sohuearly customers>500,000 tok/s~60,000 tok/soperator graph etched
Model etchedTaalas HC1just released-17,000 tok/s (L8B)weights cast into silicon

Note: these numbers come from vendor public materials and third-party reports; measurement conditions aren’t fully comparable across them — but the order-of-magnitude relationships are clear. Each step right on the spectrum brings a 3-10× speed gain, with a cumulative range approaching 1000×.

Valuation Maps Precisely to Spectrum Position — more specialized = lower valuation

Stacking the valuations of every company on the spectrum produces a remarkably clean gradient:

CompanyPathValuationDistance from most specialized
NVIDIAGPU (most general)$4T+far left
Cerebraswafer-scale~$49B (2026/5 IPO valuation)scaled
Groqstatic dataflow~$6.9B$20B NVIDIA endorsement
d-Matrixdigital in-memory~2B(SeriesC2B (Series C 275M)Series C oversubscribed
EtchedTransformer ASIC~$800Mshipping to early customers
Taalasmodel etched<$500M (estimated)just released, $169M raised

This isn’t coincidence — it’s the market’s precise pricing of “specialization risk.” The more specialized the scheme, the more concern about hardware going to zero if model architectures change in the future — so the larger the discount the market applies. NVIDIA is worth $4T partly because of its “generality premium” — no matter how AI evolves, GPUs never go to zero.

Conversely, it makes sense — Taalas’s low valuation isn’t because the tech is bad, but because the downside risk of “betting on one model” is inherently large.

The Steady-State Landscape Over the Next 3-5 Years — training / inference / interconnect, three markets

The entire AI inference chip landscape over the next 3-5 years roughly settles like this:

Training market (70-30 split) — basically settled:

Inference market (splitting into several sub-tracks) — this is the truly diverse part:

Interconnect market (newly emerging sub-category) — photonic interconnect becomes a “must”:

This is an unusual industry window — no one displaces anyone, instead different solutions eat different sub-segments. This is completely different from the past decade of “GPU rules all,” and it’s what makes this wave of AI inference chip startups genuinely interesting.

The last variable worth watching — will large-model architectures change? If the Transformer is replaced by Mamba, xLSTM, or some entirely new architecture within 5-7 years, schemes like Etched and Taalas that “bet on one architecture” go to zero; d-Matrix and Groq take a hit but can still survive; Cerebras, TPU, and NVIDIA are almost unaffected. The valuation gradient’s core logic is exactly the market’s pricing of this risk.

If you believe “the Transformer has at least 5 years left” — Etched / Taalas bets are good trades. If you think “a new architecture is inevitable within 3 years” — the money should be on the left side of the spectrum. This is the most important decision framework the spectrum offers investors.

References — company materials · industry reports · technical papers

Company Websites and Official Announcements

Industry Reports and News

Technical Papers and Blogs