Cracks in NVIDIA’s Moat
Compute is a high-certainty demand with extremely clear metrics. Unlike consumer products like Apple’s or Tesla’s — where decisions involve many factors (brand, sentiment, ecosystem stickiness) and customers can be “irrational” — B2B large customers are extremely rational and cold-blooded: as soon as a better alternative or a competitor exists, they will switch without hesitation, or at minimum proactively cultivate a second supplier to maintain bargaining leverage. NVIDIA’s monopoly is not indestructible: it is just early enough, large enough, and ecosystem-deep enough. This is a clear lane; challengers will keep appearing, and once it slows down, it will be caught up. This article addresses four questions: (1) why did challengers suddenly appear → (2) where will the impact land → (3) what else accelerates this process → (4) when does it happen, and how do we watch it.
A Different Perspective — why NVDA isn’t invincible

Image: Wikimedia Commons / NVIDIA · Public Domain.
NVIDIA’s moat is real — CUDA, NVLink, HBM lock-up, CoWoS capacity, 25 years of ecosystem accumulation — but the real bedrock of every B2B story is customer rationality. This is the largest difference vs Apple or Tesla. Apple sells 200M iPhones a year to consumers, where emotion and brand make up a large fraction of decisions; same for Tesla. But GPUs sold to OpenAI / Anthropic / Microsoft / Google / Meta / ByteDance — the buyers are the world’s most rational, cold-blooded engineering teams — they look at token/dollar, token/watt, and total cost per training run, not the logo, not the brand.
This means NVIDIA’s monopoly has a built-in instability: the moment it slows down, large customers will immediately cultivate a second supplier. Look at what’s already happened — Google’s in-house TPU (now sold externally), Amazon’s in-house Trainium, Microsoft’s in-house Maia, Meta’s in-house MTIA, and OpenAI committing $20B to Cerebras + co-developing an inference chip with Broadcom. Every one of them is actively “growing a second supplier.” This isn’t conspiracy; it’s the ABCs of B2B procurement.
| Metric | Value | Note |
|---|---|---|
| FY26 data-center revenue | $215.9B | +89% YoY · Q4 $62.3B · ~89% of total revenue · four CSPs combined 61% of quarterly revenue (NVIDIA Newsroom) |
| Non-GAAP gross margin | ~75% | Q4 75% · guidance mid-range 75% · B200 per-card margin ~84% (ASP $35–40K · BOM cost ~$6.4K) |
| AI accelerator market share | 80–90% | Significant variance across sources · 2024 peak 87% · 2026 expected ~75% |
| Cost per training run | $50M–$500M | Magnitude for very large models · error cost extremely high → no one wants to bet on immature hardware · training-fortress causation |
NVIDIA’s monopoly is not invincible: the moat is real, but it is just the sum of three things — “early enough + large enough + far-reaching ecosystem” — not a structurally insurmountable position. Once it slows down, challengers will swarm in; customer rationality reverse-accelerates this process.
— The question here is not “whether,” but “when” and “how to watch”
Architectural Convergence — why challengers suddenly appeared

2012 · CNN starting point

2015 · residual connections

2018 · Encoder-only

1997 · RNN / sequence

2015 · image segmentation

2014 · generative adversarial

Variants are only in attention grouping: MHA → MQA → GQA → MLA
Architecture diagrams from Wikimedia Commons (AlexNet · ResNet residual block · BERT · LSTM · U-Net · GAN · Full GPT Architecture), CC BY-SA 4.0 / CC0 Public Domain.
Challengers didn’t appear from nowhere; there’s a frequently overlooked structural change behind them: model architecture is converging. The CNN era was a hundred flowers — AlexNet / VGG / ResNet / Inception / MobileNet / EfficientNet — each task with its own “best architecture”; hardware didn’t know which to optimize for, and general-purpose GPUs swept the field. The Transformer era is the opposite: from the original 2017 paper to 2026 frontier models, the backbone has barely changed — Pre-LayerNorm, RoPE, RMSNorm, SwiGLU, Residual, KV Cache are all old techniques from years ago.
Today’s “innovation” is essentially limited to a few attention-grouping variants (MHA → MQA → GQA → MLA), with GQA now the de facto default for open-source large models; Mamba / SSM and other challengers have ultimately appeared in “hybrid architecture” form rather than as replacements for Transformer — a NeurIPS 2025 paper even proves that Mamba and Transformer belong to the same complexity class and are essentially equivalent. This is unprecedented stability in ML history. Its significance parallels x86’s win — once architecture converges, the target of hardware optimization becomes certain for the first time.
Hardware Design Freedom Unlocked by Convergence
- Operators can be fully fixed. Cutting from hundreds of operators down to a few core ones (MatMul, Softmax, RoPE, LayerNorm) is enough; all non-essential circuitry can be removed. The cost of GPU generality is many transistors wasted on paths that will never be invoked by LLMs.
- Dataflow can be planned at compile time. Transformer compute graphs are statically regular, with no need for the GPU’s complex dynamic scheduling — GPUs waste many transistors on warp schedulers, memory coalescing, and other circuitry that adapts to “unknown workloads.”
- Memory hierarchy can be tailored. Parameter size, KV-cache size, and activation size are all known in advance. A challenger can design SRAM/HBM/DRAM hierarchies directly for “7B model + 32K context” rather than reserving buffers for “arbitrary workloads” like GPUs do.
- Numeric precision can be aggressive. int4 / int8 is already inference norm; FP4 / FP6 is hot research. GPUs still have to retain FP32 / FP64 circuits for the scientific computing market — entirely redundant for LLMs.
- Extreme: bake weights into silicon. Taalas’s approach pushes “architectural convergence” to the logical extreme — etching the 32 layers of Llama 3.1 8B weights as physical transistors directly on the chip. The chip is the model itself, can’t be changed, but perf/watt is claimed to be 1000× better than a GPU cluster. See the “Training-Inference Split” section.
The Moat Wasn’t Breached — It Was Bypassed
CUDA’s thickness advantage was premised on “needing to support hundreds of operators” — once models converge, that premise disappears.
Analogy: in the past Office’s moat was thousands of feature components; today people use Notion to write docs, and fewer than 50 of those thousands of features are touched — the moat wasn’t breached, it was bypassed.
This is why so many AI chip startups suddenly appeared in 2024–2026: it isn’t that capital suddenly grew, it’s that the design target finally stabilized, fundamentally raising the win rate of the bet. Layered on top of that, with AI coding tools like Claude Code / Codex becoming pervasive, the cost structure of software migration is being rewritten — CUDA’s moat is being doubly eroded.
Training-Inference Split — two markets, two logics

Image: Wikimedia Commons / Google · CC BY 4.0.
Mixing training and inference together leads to severe misjudgment of NVIDIA’s true position. Compute allocation is flipping dramatically: in 2023 training:inference ≈ 2:1, in 2025 it’s 1:1, in 2026 it will be 1:2, and by 2029 close to 1:4. Lenovo executives at CES 2026 said directly “future is 20% training, 80% inference”; Jensen himself acknowledges the inference market will eventually be “about a billion times” larger than training. Over a single AI system’s lifecycle, inference accounts for 80–90% of total cost, training only 10–20%.
This means: NVIDIA’s true money-printing market (training) is becoming the relatively smaller one; the truly massive market (inference) is precisely where its moat is thinnest.
Training vs Inference · Scissors of Compute Share
2025 is the watershed — inference exceeds training for the first time. By 2029, training is only ~20%, inference ~80%. NVIDIA’s dominance in training is unassailable in the short term, but every inference workload characteristic works against it.
Training Market: Fortress, NVIDIA Unassailable Short Term
- Compute extremely concentrated. Tens of thousands of cards collaborating in a single training run, with extreme demands on NVLink / InfiniBand; non-GPU paths lack this complete ecosystem layer.
- Customers extremely concentrated. Fewer than 10 players globally do frontier pre-training — OpenAI, Anthropic, xAI, Google, Meta, Microsoft, ByteDance, Mistral, DeepSeek, Cohere — short decision paths, each speaks to Jensen directly.
- Error cost extremely high. One training run costs hundreds of millions of dollars and takes half a year. No one will gamble on immature hardware — “nobody got fired for buying NVIDIA” is the 2026 version of “nobody got fired for buying IBM.”
- Heaviest software stack. CUDA + NCCL + cuDNN + Megatron + DeepSpeed + various fused kernels — full-stack migration cost is enormous. Anthropic took over a year to migrate training workloads to TPU.
- Model architecture still evolves on the training side. MoE, long context, test-time scaling, sparse activation, mixed-precision training — training still needs hardware flexibility, which is the last comfort zone for GPU generality.
- Conclusion. In the training market, NVIDIA likely retains 80%+ share through 2028. Google TPU is the only competitor that has demonstrated training capability, but it’s only opened up to a handful of customers (Anthropic + Apple AFM training), not market competition.
Inference Market: Soft Spot, Every Characteristic Works Against NVIDIA
- Can be deployed in a distributed way. Single card or small clusters suffice; NVLink advantage is completely irrelevant. Cerebras / Groq / various in-house ASICs all hold their ground here.
- Compute pattern is highly regular. The same attention + FFN invoked millions of times — perfect for an ASIC. GPU generality is pure waste here.
- Low precision suffices. int8 / int4 will run; FP4 is experimentally usable. The FP32 / FP64 circuits retained on GPUs are silicon wasted for nothing.
- Memory wall matters more than compute wall. This is the natural advantage of alternative architectures (wafer-scale Cerebras, in-memory computing D-Matrix, near-memory Etched / MatX) — they are all designed for “memory bandwidth equals performance”; GPUs are not.
- Customers are fragmented and cost-sensitive. They look at tokens / dollar and tokens / watt, not peak compute. Google TPU delivers 4.7× price-performance and 67% lower energy on inference — Anthropic, Meta, and Midjourney have already migrated parts of their inference workloads.
- Model has converged. Hardware can fully optimize for the now-fixed Transformer (see “Architectural Convergence” section).
“The perception that GPUs were the only answer was NVIDIA’s biggest moat. Now that perception is collapsing.”
— Andrew Feldman, Cerebras founder · 2026
Extreme Path · Taalas: Model Vendors Could Directly Become Hardware Players
- Technical principle · weights = physical transistors. The 32 layers of Llama 3.1 are etched as physical transistors on the chip. No HBM, no loading weights from memory. An input vector enters, the electrical signal flows through the transistors corresponding to each layer, then to the next, until a token is produced. The whole chip = the model itself, unchangeable.
- Performance · 17,000 tok/s · 1000× perf/watt vs GPU cluster. 17,000 tokens/sec on Llama 3.1 8B; a single 250W air-cooled card matches an entire GPU cluster. Taalas uses an automated design flow, compressing the weights-to-silicon cycle to 2 months, only swapping the top metal mask — the architecture has been stable for 7–8 years without major change, substantially lowering depreciation risk.
- If it works · model vendors → hardware players. OpenAI sells GPT-Edition inference cards, Anthropic sells Claude-Edition inference boxes, bypassing both cloud vendors and NVIDIA. Enterprise deployment shifts from “buy GPU + download weights + configure CUDA” to “buy a box, plug it in.” Privacy and IP are solved simultaneously — weights physically reside in the customer’s data center, and extracting them is equivalent to reverse-engineering a chip. OpenAI’s $20B commitment to Cerebras procurement + cooperation with Broadcom on in-house inference chips is an early signal of exactly this logic.
Challenger Landscape (2026)
Funding / customers / approach basis · data sources Fortune, CNBC, Digitimes, TrendForce (2026 Q1).
| Category | Player | Approach / Advantage | Customers / Status | Key numbers |
|---|---|---|---|---|
| Hyperscaler in-house | Google · TPU | ASIC · train + inference dual stack | Internal + Anthropic + Apple AFM training | v4/v5p · inference 4.7× price-performance |
| Hyperscaler in-house | Amazon · Trainium / Inferentia | ASIC · inference focus | Anthropic · AWS internal | Trainium2 ramping |
| Hyperscaler in-house | Microsoft · Maia | ASIC · in-house cloud inference | Azure internal + OpenAI partial | Mass production since 2024 |
| Hyperscaler in-house | Meta · MTIA | ASIC · recs + LLM inference | Meta internal | MTIA v2 deployed |
| In-house vs GPU scissors | TrendForce 2026 shipment growth | ASIC +44.6% vs GPU +16.1% | Structural · not cyclical | 2027 ASIC = 15%+ of data center |
| Independent vendors | Cerebras · WSE | Wafer-scale · in-package | OpenAI reportedly procured $20B · IPO in progress | Mid-May IPO valuation $30B+ |
| Independent vendor | Groq · LPU | Low-latency token-batch | Already acquired by NVIDIA for $20B IP / team (2025-12) | The acquisition itself proves the threat |
| Independent vendor | D-Matrix | in-memory compute | Microsoft-backed | 2026 $5B-class funding |
| Independent vendor | SambaNova | reconfigurable dataflow | Intel reportedly signed acquisition term sheet | Acquisition in progress |
| Independent vendor | Etched / MatX / Ayar Labs | Near-memory / optical interconnect | 2026 each took $5B-class funding | Silicon photonics trend |
| Independent vendor | Furiosa / Positron / Taalas | Korean + US newcomers · model-in-silicon | Taalas: Llama 3.1 8B @ 17K tok/s | 1000× perf/watt |
| Geopolitical · China | Huawei Ascend · Alibaba PPU · Cambricon · Biren | Export controls force maturation | China AI-chip market ~$50B/yr (Jensen estimate) | 2025 domestic substitution acceleration |
| Geopolitical · Europe | Fractile · Axelera · Euclyd · Optalysys | Photonics / analog compute | Early stage | Longer horizon |
| Startup total funding | 2026 global AI-chip startups | Capital has bet “NVIDIA can’t win forever” | 2026 cumulative funding | $8.3B |
Data sources: Fortune (2026-01), CNBC (2026-04-17), Digitimes, TrendForce 2026 Q1. Cerebras mid-May IPO timing is publicly disclosed; OpenAI-Cerebras $20B procurement intent is via media reports. NVIDIA itself has invested $4B as a hedge (photonic computing, analog / physical computing).
Political Risk — independent variable

Image: Wikimedia Commons / Public Domain.
NVIDIA’s political risk is widely underestimated. Jensen, on Dwarkesh Patel’s podcast on 2026-04-15, repeatedly made “pledge-of-loyalty”-style statements and got into a dispute with the host: when asked about the national-security risks of selling advanced chips to China, he lost his composure and replied “You’re not talking to somebody who woke up a loser.” Critics argue he repeatedly dodged direct answers on national-security questions, only emphasizing “selling more US technology is a good thing,” and was seen as overly utilitarian. Earlier on the Bg2 podcast, he called the “China hawk” label “a badge of shame,” prompting commentary that he was “auditioning for a Global Times editorial.”
Fundamental Differences vs Apple · Tesla
- Apple · supply-chain exposure · contributing to China. Apple’s China exposure is manufacturing supply chain + partial sales — India / Vietnam are diversifying. Money flow: China manufacturing → Apple procurement → supply-chain workers benefit. The narrative can be framed as “contributing to China.” Apple comes with built-in universal values, with a friendly posture in Congressional questioning.
- Tesla · market + factory · contributing industrial capability. The Tesla Shanghai factory is a template for the Chinese EV supply chain; Cybertruck / Model Y are sold to Chinese consumers. Musk comes with built-in MAGA credentials — at most subject to internal Democratic / Republican controversy, but no external “pro-CCP” treason-class risk.
- NVIDIA · selling to China · strategic materials · suspicion of treason. NVIDIA is “selling to China,” and what’s sold is officially classified as “militarily critical technology” — this is a qualitative difference, not a quantitative one. Layered on Jensen’s ethnic Chinese identity + repeatedly questioned statements as “overly utilitarian,” US grassroots have natural distrust toward a pro-communist ethnic-Chinese tech giant, and Congressional members will inevitably cater to this voter market.
Decoupling Script · Passive Rather Than Active
A signature event of “publicly endorsing Taiwan independence + denouncing the CCP” has low probability; the more likely path is passive ejection.
Jensen’s instinct is merchant-style evasion — after saying Taiwan is a “country” in 2024, he proactively clarified the next day with “this is not a geopolitical statement,” smoothing things over — rather than political stance-taking. So the more likely decoupling script is:
- US Congress pushes tighter export controls — the generation after H20 continues to get blocked; Sovereign AI legislation lands
- CFIUS intervenes on NVIDIA’s China cooperation / customer structure
- Board comes under pressure to revise China-related business disclosures; initiates strategic review
- Forced exit from the China market — the 25% data-center revenue figure goes to zero
- The result is similar but the posture is very different — not a heroic breakup, but a regulatory push.
So NVIDIA will inevitably be politically handled before Apple / Tesla. This is a second acceleration curve independent of the challenger structure.
Timing and Signals — when & how to watch

Image: Wikimedia Commons / 2017 · CC BY-SA 4.0.
This is the easiest place in the entire analysis to misjudge. NVIDIA’s decline is almost certain, but between “almost certain” and “starting next year” there is still a long road. Supply-chain lock-in (HBM, CoWoS, ODM full-rack capacity pre-booked years out) + training-side software inertia + large customers unwilling to switch all at once — these three buffer layers are very thick. The hard part for shorts isn’t direction, it’s timing.
The pie is growing: four major cloud vendors combined 2026 capex approaches $600B; Goldman Sachs estimates 2025–2027 total capex of $1.15T. Even with all challengers in position, NVIDIA AI data-center share in 2028 likely remains above 60% — share down but absolute revenue could still grow.
Three-Phase Rhythm
- Phase 1 · 2026–2027 · Apparently still winning. Total revenue still grows, gross margin holds at 70–75%. Hyperscaler in-house ASICs capture 10–15% share, but training market expansion offsets inference decline. On the surface NVIDIA appears unscathed — this is the phase most easily fooled by appearances.
- Phase 2 · 2027–2028 · The real inflection window. Inference market absolute size exceeds training; the dollar amount of inference outflow starts to suppress training growth, overall share visibly declines, gross margin slides toward 65–70%. This is the real inflection window — non-GAAP gross margin breaking 70% is a structural signal.
- Phase 3 · post-2028 · Fortress shaken. Google TPU, the AMD MI series, and possibly OpenAI’s in-house silicon make substantive breakthroughs on the training side; if the “model-in-silicon” path succeeds, the entire inference hardware market is reshuffled, and gross margin reverts toward below 60%. GPUs go from “allocation goods” back to “commodities.”
Non-GAAP Gross Margin · Three-Phase Trajectory
Reference: in 2022 the gaming-GPU cycle saw gross margin fall from 64% to 56% (over one year) — a reference for the speed of cyclical supply-demand reversal. The structural inflection will be slower but harder to recover from. The real alarm is non-GAAP gross margin breaking 70% and failing to recover.
Primary Alarms · Tracking Three Non-GAAP Gross Margin Thresholds
- Tier-1 alarm · gross margin < 70% · challengers have meaningful pricing leverage. This is a structural signal, not cyclical fluctuation. It means major customers are negotiating with alternatives in hand; the 35% premium model of Blackwell Ultra is hard to sustain.
- Tier-2 alarm · gross margin < 65% · entering pricing-power-loss spiral. Even scale effects can’t rescue. 2–3 consecutive Q-on-Q declines confirm.
- Tier-3 alarm · B-series / Rubin price cuts for inventory clearance · GPUs back to commodities. From “allocation goods” back to “commodities.” Once price cuts for inventory clearance appear, it means the structural supply-demand reversal is complete.
Secondary Signals
- Hyperscaler in-house ASIC share. Whether “in-house ASIC share” in capex breaks 25%. Currently Google + Amazon combined ~15%; watch 2027 H2 data.
- Inference product-line pricing pressure. Pricing pressure on inference-optimized SKUs like B200 NVL will loosen earlier than training flagships.
- Model vendors’ in-house / procured non-GPU inference hardware actual shipments. Deals like OpenAI-Cerebras $20B should be judged by actual delivery, not announcement; the tape-out pace of OpenAI’s in-house silicon (with Broadcom) is a key node.
- Cerebras post-IPO valuation performance. Market pricing of alternative paths. If post-IPO valuation holds at $30B+, the public market endorses the challenger narrative; if it breaks below $15B, the market still believes the NVIDIA monopoly.
- China market actual revenue magnitude. Lagging indicator of geopolitical progress. If export controls tighten further and revenue share falls from 25% to below 5% — it’s the marker of the political script materializing.
- Sovereign AI-class legislation. Any legislation at the US Congress level that requires “pre-approval for sensitive compute shipments” is a signal that the political script is accelerating.
Synthesis — the inflection signal

Image: Wikimedia Commons / CC BY-SA 4.0.
The full chain of reasoning closes.
Six Structural Observations
- Architectural convergence frees hardware design degrees. Transformer is the new x86; CUDA wasn’t breached, it was bypassed.
- Challenger landscape now formed. Hyperscaler ASIC + tier-1 startups + China domestic substitution + model vendors going to silicon; 2026 global startups have raised $8.3B.
- Training fortress holds but inference soft spot exposed. 2025 is the watershed, 2029 train:inference at 1:4. Every inference workload characteristic works against NVIDIA.
- Model vendors may step into the hardware layer. OpenAI investing $20B in Cerebras + Broadcom in-house inference chip + Taalas model-in-silicon — the entire AI value chain may be re-sliced.
- Political risk accelerates this process. NVDA’s China exposure differs fundamentally from AAPL / TSLA’s — “selling strategic materials” vs “contributing to supply chain”; Jensen’s ethnic Chinese background + repeatedly questioned statements as “overly utilitarian” mean in the US Congressional voter market, logically he must be handled before Apple / Tesla.
- Time window longer than intuition suggests. HBM / CoWoS / ODM lock-in + training software inertia + large customers’ reluctance to switch wholesale — these three buffer layers are thick. The hard part for shorts isn’t direction, it’s timing.
Bottom Line
Dominance remains, and remains strong. But beyond the moat, challengers have moved from “catching up” to “bypassing.”
What really needs to be watched isn’t “who’s catching up” but “whether NVIDIA’s own inflection signal has appeared” — non-GAAP gross margin is that inflection signal.
Tracking cadence:
- Gross margin 75% → 70% (Tier-1 alarm)
- Gross margin 70% → 65% (Tier-2 alarm)
- B / Rubin price cuts for inventory clearance (Tier-3 alarm)
- Hyperscaler in-house ASIC share breaks 25%
- OpenAI-Cerebras $20B actual delivery
- Sovereign AI / export controls escalation
Gross margin staying above 70%, the monopoly structure stays stable; once it breaks 70% and cannot recover, that’s the moment the inflection signal is triggered.
Apple’s real advantage in the AI era isn’t “I can also build models,” it’s “I don’t need to build models” (see /zh/signals/apple-ai). NVIDIA’s real danger isn’t “someone is chasing” — it’s “can’t slow down” — the rationality of B2B customers will reverse-accelerate the whole process. These two are two sides of the same structural observation.
— Contrarian research · 2026-04-26
References — official · papers · industry research · media
Official and First-hand
- NVIDIA Newsroom — product launches and quarterly updates. nvidianews.nvidia.com
- Wikimedia Commons — public images and primary source material. commons.wikimedia.org
Papers and Challengers
- arXiv 2510.05364 · NeurIPS 2025 candidate paper — research on GPU alternative architectures. arxiv.org/abs/2510.05364
- Taalas — “AI ASIC” startup baking models into silicon. taalas.com
- Cerebras Systems — Wafer-Scale Engine (WSE) approach. cerebras.net
Industry Data / Research Institutions
- Digitimes — Taiwan supply-chain weekly tracking. digitimes.com
- TrendForce — HBM / DRAM / server market share. trendforce.com
- Goldman Sachs Research — sell-side AI capex and compute demand models. goldmansachs.com
Media and Podcasts
- Computerworld · Lenovo executives at CES 2026 on GPU replacement. computerworld.com
- Fortune — industry profile interviews and strategy coverage. fortune.com
- CNBC — financial news and major NVDA event tracking. cnbc.com
- Dwarkesh Patel · Jensen Huang interview podcast. dwarkesh.com