
AI Hardware — March 27, 2026
70 Million TB/s: The Three-Lever Mechanism
Driving AI Memory Bandwidth Growth.
Epoch AI calculated 70 million terabytes per second of cumulative AI chip memory bandwidth as of Q4 2025, growing 4.1x per year. Here is the three-lever mechanism behind that rate and why HBM4’s logic base die changes inference capacity in 2027.
Sources: Epoch AI compute tracker Q4 2025; JEDEC HBM4 specifications; NVIDIA H100/H200 memory bandwidth specs; SK Hynix HBM4 roadmap; March 2026.
NVIDIA’s B200 GPU delivers 8 TB/s of HBM3e memory bandwidth per chip. A DGX B200 system with eight GPUs delivers 64 TB/s. A rack of four DGX systems approaches 256 TB/s. A full-scale training cluster with hundreds of racks exceeds 70 million TB/s of aggregate memory bandwidth. That number sounds abstract until you understand what it means for AI model training and inference: memory bandwidth, not compute FLOPS, is the bottleneck that determines how fast frontier AI models can run. The AI hardware race in 2026 is not about who has the most transistors. It is about who can move data to those transistors fastest.
The three levers of AI hardware performance are compute (measured in FLOPS, how many operations per second), memory bandwidth (measured in TB/s, how fast data can be fed to the compute units), and interconnect (measured in GB/s per link, how fast GPUs can communicate with each other during distributed training). Every AI hardware generation improves all three. But the relative importance of each lever has shifted. In 2020, compute was the binding constraint: models needed more FLOPS than hardware could provide. By 2026, compute has scaled faster than memory bandwidth, creating a new bottleneck.
Why Memory Bandwidth Is the Bottleneck
A modern frontier model (GPT-5 class, 1 trillion+ parameters) stores its parameters in GPU memory (HBM). During inference, every token generated requires reading a significant portion of those parameters from memory and feeding them to the compute units. The compute units can process the data faster than the memory system can deliver it. The GPU’s arithmetic logic units are idle, waiting for data. This is the “memory wall” problem, and it determines the maximum tokens-per-second throughput for inference workloads.
The math is straightforward. A 1 trillion parameter model stored in FP16 requires 2 TB of memory. Generating one token requires reading a fraction of those parameters (determined by the model architecture and batch size). At 8 TB/s memory bandwidth (B200), a single GPU can read its entire local memory in roughly 125 milliseconds. For models that exceed single-GPU memory capacity (which all frontier models do), the parameters are split across multiple GPUs, and the interconnect bandwidth determines how fast the split model can synchronize. The entire pipeline, from “user sends a query” to “model generates a response,” is gated by how fast data moves through memory and across interconnects, not by how fast the compute units can process it.
The Three-Lever Mechanism
What This Means for AI Cost Structure
The memory bandwidth bottleneck directly affects AI inference economics. Inference cost is determined by how many tokens per second a GPU can generate, which is limited by memory bandwidth, not compute. A GPU with 2x the compute but the same memory bandwidth generates tokens at roughly the same speed for memory-bound workloads. This is why NVIDIA’s Blackwell generation focused on HBM3e memory (8 TB/s vs 3.35 TB/s) rather than dramatically increasing compute FLOPS. The compute improvement matters for training. The memory improvement matters for inference. And inference is 85% of enterprise AI spending in 2026.
Google’s TurboQuant 6x inference optimization (which achieves 6x throughput improvements on Gemini models) works by reducing the precision of model weights, which reduces the amount of data that needs to be read from memory per token. Quantization (reducing weights from FP16 to INT4 or lower) is an algorithmic solution to a hardware problem: if you cannot increase memory bandwidth, reduce the amount of data that needs to flow through it. Every major inference optimization technique in 2026 (quantization, speculative decoding, KV-cache compression, mixture-of-experts routing) is fundamentally a technique for reducing memory bandwidth requirements.
The HBM Supply Chain
HBM (High Bandwidth Memory) is manufactured by three companies: SK Hynix (South Korea), Samsung (South Korea), and Micron (United States). SK Hynix holds approximately 50% market share for HBM3e, the current generation. Samsung and Micron split the remainder. HBM production requires advanced packaging technology (stacking multiple DRAM dies with through-silicon vias) that is capacity-constrained. The demand for HBM from AI GPU manufacturers exceeds current production capacity, which is why GPU delivery timelines extend 6 to 12 months and why GPU prices remain elevated despite increasing production volumes.
The HBM supply constraint is the hidden bottleneck in the AI hardware supply chain. NVIDIA can design faster GPUs. TSMC can fabricate the GPU chips. But the complete GPU cannot ship without HBM, and HBM production scales more slowly than GPU demand. This constraint explains why NVIDIA’s data center revenue growth (while massive) is supply-constrained rather than demand-constrained. The company sells every GPU it can produce. The limit is how many GPUs it can produce, which is partially determined by HBM availability.
What Comes After the Memory Wall
The industry’s response to the memory wall operates on three timescales. In the near term (2026 to 2027), algorithmic optimizations (quantization, sparsity, KV-cache optimization) reduce memory bandwidth requirements without changing hardware. In the medium term (2027 to 2029), next-generation memory technologies (HBM4, with projected 2x bandwidth improvement over HBM3e) and compute-near-memory architectures (placing processing elements directly in the memory stack) attack the problem at the hardware level. In the long term (2029+), fundamentally new computing architectures (optical interconnects, photonic computing, neuromorphic chips) may eliminate the memory wall entirely by changing how compute and memory interact.
For AI builders in 2026, the memory bandwidth constraint has immediate practical implications. Inference cost per token is determined by memory bandwidth utilization, not compute utilization. Optimizing inference means optimizing memory access patterns. The cheapest way to reduce inference costs is not to buy more GPUs. It is to reduce the memory bandwidth each inference request consumes through quantization, batching, and caching. The companies that understand this are the ones running inference profitably. The companies that throw compute at a memory-bound problem are the ones burning money on GPUs whose arithmetic units sit idle waiting for data.
Sources: NVIDIA Blackwell architecture white paper (B200 specifications); NVIDIA DGX B200 system specifications; AMD MI300X technical specifications; Google TPU v5e documentation; SK Hynix HBM3e production data; Samsung/Micron HBM market share (TrendForce); Google TurboQuant technical blog; AnalyticsWeek inference economics analysis; NVIDIA GTC 2026 presentations.