Tag: Inference Optimization

  • ASML Is the Only Company That Can Make AI Chips Possible. Its Next Machine Costs 0 Million.

    ASML Is the Only Company That Can Make AI Chips Possible. Its Next Machine Costs $400 Million.

    ASML Is the Only Company That Can Make AI Chips Possible. Its Next Machine Costs 0 Million.

    Semiconductor Hardware — March 2026

    High-NA EUV Is 60% Smaller Features.
    ASML Ships One Machine Per Month.

    ASML’s High-NA EUV lithography tools enable 8nm features vs 13nm on standard EUV. Intel takes the first shipments. The bottleneck for every AI chip generation is now a single Dutch factory shipping 12-15 units per year.

    0.55
    NA Aperture
    8nm
    Feature Size
    €350M+
    Per Machine
    1
    Supplier Globally

    Sources: ASML annual report 2025; Intel investor day; ASML High-NA EUV technical specifications; SEMI equipment market data.

    ASML’s High-NA EUV lithography system (the EXE:5000 series) shipped its first units to Intel in 2025 and entered broader early adoption in 2026. The machine uses a numerical aperture of 0.55, up from 0.33 in standard EUV, which reduces the minimum resolvable feature size from approximately 13 nanometers to 8 nanometers. Every next-generation AI accelerator that requires denser transistor packing depends on either this machine or a future generation of it.

    ASML produces 12 to 15 High-NA EUV tools per year at its Veldhoven facility. That production rate, multiplied by the number of chipmakers who need the tool to stay competitive, defines the entire pace at which AI hardware can advance. ASML is a harder bottleneck for AI scaling than GPU availability, model architecture, or training data.

    How High-NA Changes the Physics

    Standard EUV (0.33 NA) achieves approximately 13nm half-pitch resolution and is used for TSMC N3 and Samsung 3nm nodes, with about 100 units shipped annually. High-NA EUV (0.55 NA) achieves approximately 8nm half-pitch resolution, replaces multipatterning with single-pass exposure, and targets Intel 14A and future TSMC N2P+ nodes. ASML ships 12 to 15 High-NA units per year as of 2026.

    The Rayleigh criterion defines the relationship: resolution equals k1 multiplied by wavelength divided by NA. Higher NA means smaller minimum features at the same 13.5nm EUV wavelength. The shift from 0.33 to 0.55 NA also eliminates several multipatterning steps, improving yield and reducing defect density.

    Why This Is the Actual AI Chip Bottleneck

    NVIDIA’s Blackwell architecture and every planned successor requires advancing process nodes to maintain performance-per-watt improvements that make training and inference economically viable. Those process node advances require EUV, and the leading edge of EUV is High-NA. The supply chain runs: ASML ships 12 machines per year, fabs use them to produce next-generation wafers, chip designers tape out AI accelerators on those wafers, hyperscalers buy the chips. Constrain any step and the entire chain compresses.

    Export controls have already demonstrated this constraint. The U.S. restricted ASML from shipping standard EUV machines to Chinese chipmakers in 2023. China’s most advanced domestic chips are stuck at approximately 7nm nodes achievable with DUV (deep ultraviolet) lithography, several generations behind TSMC’s current production. High-NA EUV, which ASML cannot ship to China under current controls, represents a two-generation gap that cannot be closed by domestic Chinese tool development within the current decade.

    Limitations and What the Roadmap Does Not Tell You

    Production ramp is extremely slow: 12-15 units per year means each major fab gets 2-4 machines annually. Yield learning, tool calibration, and process development at the fab level take 12-18 months after installation before volume production begins.

    The pellicle problem: High-NA EUV requires new pellicle technology (thin membranes that protect the mask from particles during exposure). Pellicle production for High-NA is not yet at volume, constraining throughput.

    Throughput vs. standard EUV: High-NA tools currently achieve lower wafers-per-hour throughput than mature standard EUV. The economics only favor High-NA when the feature density gain outweighs the throughput penalty, which depends on the specific chip design.

    ASML will produce more High-NA units as it scales Veldhoven capacity. The 12-15 per year figure is 2026 early production, not the steady-state. But every node transition in semiconductor history has taken longer than the announced roadmap. The AI chip supply chain is more dependent on ASML’s production ramp executing on schedule than on any single AI model architecture decision.

    How EUV Lithography Actually Works

    Extreme ultraviolet lithography prints circuit patterns using light with a 13.5-nanometer wavelength, roughly 14 times shorter than the deep ultraviolet (193nm) light used by the previous generation. Shorter wavelength means smaller features: EUV can print transistor features below 7 nanometers, enabling the chip densities that modern AI accelerators require. The physics is straightforward. The engineering is not.

    Generating EUV light requires hitting tiny droplets of molten tin with a high-powered laser 50,000 times per second. Each droplet explodes into a plasma that emits EUV photons. The photons are collected by a multilayer mirror with 70% reflectivity (compared to 99%+ for DUV lenses), bounced through a series of precision mirrors, and projected through a mask onto a silicon wafer coated with photoresist. The entire process happens inside a vacuum chamber because air absorbs EUV light. Every component operates at tolerances measured in picometers.

    ASML’s current EUV machines (the NXE series) cost approximately $200 million each and weigh 180 tons. They require their own building wing with dedicated power, cooling, and vibration isolation. A single machine can process 170 wafers per hour. TSMC, Samsung, and Intel operate these machines around the clock. The machines are so complex that ASML maintains permanent engineering teams at each customer site. No other company has successfully commercialized EUV lithography. Canon and Nikon never made the transition from DUV to EUV.

    Why High-NA Changes the Math

    The next generation, High-NA EUV (the EXE:5000 series), increases the numerical aperture from 0.33 to 0.55. It can print features 1.7 times smaller than current EUV, enabling sub-2nm chip geometries. The cost: $400 million per machine. The weight: over 250 tons. The precision requirements: mirror surfaces accurate to within 0.02 nanometers, less than the width of a single atom.

    ASML has delivered High-NA tools to Intel and TSMC for qualification testing. Volume production deployment is expected in 2026 to 2027. The transition timeline matters for AI chips because NVIDIA’s next-generation GPU architectures (post-Blackwell) will require High-NA EUV to achieve the transistor densities in their designs. If ASML’s production ramp delays, NVIDIA’s chip roadmap delays. If NVIDIA’s chip roadmap delays, the entire AI hardware supply chain delays.

    The concentration risk is absolute. ASML has zero competitors in EUV. If ASML’s single factory in Veldhoven, Netherlands, experienced a disruption, there is no alternative source of EUV lithography systems anywhere in the world. The entire semiconductor industry’s ability to manufacture advanced chips depends on one company, in one city, in one country. That is not a supply chain. That is a single point of failure.

    The geopolitical dimension adds another layer. ASML operates under Dutch export controls that, since October 2023, prohibit the sale of advanced lithography equipment to China. These restrictions were implemented at U.S. request and have effectively frozen Chinese semiconductor manufacturers at the DUV generation. China’s domestic alternatives (Shanghai Micro Electronics Equipment, SMEE) produce lithography systems roughly two generations behind ASML’s current EUV tools. The export controls mean ASML’s technology is not just commercially dominant. It is geopolitically contested, which adds regulatory and political risk to an already concentrated supply chain.

    Sources: ASML annual report 2025; ASML EXE:5000 product specifications; Intel investor day 2025; SEMI global equipment market data; Nature Electronics lithography review.

  • ASML Is the Only Company That Can Make AI Chips Possible. Its Next Machine Costs 0 Million.

    70 Million TB/s: The Three-Lever Mechanism Driving AI’s Memory Bandwidth Growth

    ASML Is the Only Company That Can Make AI Chips Possible. Its Next Machine Costs 0 Million.

    AI Hardware — March 27, 2026

    70 Million TB/s: The Three-Lever Mechanism
    Driving AI Memory Bandwidth Growth.

    Epoch AI calculated 70 million terabytes per second of cumulative AI chip memory bandwidth as of Q4 2025, growing 4.1x per year. Here is the three-lever mechanism behind that rate and why HBM4’s logic base die changes inference capacity in 2027.

    70M
    TB/s Cumulative
    Total AI chip memory bandwidth Q4 2025. Epoch AI measurement. Growing 4.1x annually.
    4.1x
    Annual Growth Rate
    Faster than Moore’s Law for memory. Three independent levers driving compounding gains.
    HBM4
    2027 Step Change
    Logic base die integration in HBM4 adds compute alongside memory. Fundamentally different architecture.
    Mix
    Uncertainty
    Unreported H100/H200 deployment mix introduces real uncertainty in Epoch AI’s estimates.

    Sources: Epoch AI compute tracker Q4 2025; JEDEC HBM4 specifications; NVIDIA H100/H200 memory bandwidth specs; SK Hynix HBM4 roadmap; March 2026.

    NVIDIA’s B200 GPU delivers 8 TB/s of HBM3e memory bandwidth per chip. A DGX B200 system with eight GPUs delivers 64 TB/s. A rack of four DGX systems approaches 256 TB/s. A full-scale training cluster with hundreds of racks exceeds 70 million TB/s of aggregate memory bandwidth. That number sounds abstract until you understand what it means for AI model training and inference: memory bandwidth, not compute FLOPS, is the bottleneck that determines how fast frontier AI models can run. The AI hardware race in 2026 is not about who has the most transistors. It is about who can move data to those transistors fastest.

    The three levers of AI hardware performance are compute (measured in FLOPS, how many operations per second), memory bandwidth (measured in TB/s, how fast data can be fed to the compute units), and interconnect (measured in GB/s per link, how fast GPUs can communicate with each other during distributed training). Every AI hardware generation improves all three. But the relative importance of each lever has shifted. In 2020, compute was the binding constraint: models needed more FLOPS than hardware could provide. By 2026, compute has scaled faster than memory bandwidth, creating a new bottleneck.

    Why Memory Bandwidth Is the Bottleneck

    A modern frontier model (GPT-5 class, 1 trillion+ parameters) stores its parameters in GPU memory (HBM). During inference, every token generated requires reading a significant portion of those parameters from memory and feeding them to the compute units. The compute units can process the data faster than the memory system can deliver it. The GPU’s arithmetic logic units are idle, waiting for data. This is the “memory wall” problem, and it determines the maximum tokens-per-second throughput for inference workloads.

    The math is straightforward. A 1 trillion parameter model stored in FP16 requires 2 TB of memory. Generating one token requires reading a fraction of those parameters (determined by the model architecture and batch size). At 8 TB/s memory bandwidth (B200), a single GPU can read its entire local memory in roughly 125 milliseconds. For models that exceed single-GPU memory capacity (which all frontier models do), the parameters are split across multiple GPUs, and the interconnect bandwidth determines how fast the split model can synchronize. The entire pipeline, from “user sends a query” to “model generates a response,” is gated by how fast data moves through memory and across interconnects, not by how fast the compute units can process it.

    The Three-Lever Mechanism

    AI Hardware Scaling: The Three Levers
    Compute (FLOPS): NVIDIA B200 delivers 9 petaFLOPS of FP4 throughput. AMD MI300X delivers 5.3 petaFLOPS. Google TPU v5e delivers approximately 400 teraFLOPS per chip (but deployed in pods of thousands). Compute has scaled roughly 1000x since 2016 (Pascal to Blackwell). It is no longer the primary bottleneck for most workloads.
    Memory bandwidth (TB/s): B200 delivers 8 TB/s via HBM3e. The previous generation H100 delivered 3.35 TB/s via HBM3. That is a 2.4x improvement in one generation. Memory bandwidth has scaled roughly 10x since 2016, far slower than compute. This differential is the memory wall: compute improves faster than memory can feed it.
    Interconnect (NVLink, InfiniBand): NVIDIA’s NVLink 5.0 (Blackwell generation) delivers 1.8 TB/s bidirectional bandwidth between GPUs. The previous generation NVLink 4.0 delivered 900 GB/s. InfiniBand NDR delivers 400 Gb/s per port for inter-node communication. Interconnect determines how large a model can be distributed across GPUs without communication overhead dominating compute time.
    The imbalance: Compute has scaled 1000x. Memory bandwidth has scaled 10x. Interconnect has scaled approximately 20x. The gap between compute scaling and memory/interconnect scaling is the fundamental tension in AI hardware design. Every hardware generation since 2020 has been an attempt to close this gap.

    What This Means for AI Cost Structure

    The memory bandwidth bottleneck directly affects AI inference economics. Inference cost is determined by how many tokens per second a GPU can generate, which is limited by memory bandwidth, not compute. A GPU with 2x the compute but the same memory bandwidth generates tokens at roughly the same speed for memory-bound workloads. This is why NVIDIA’s Blackwell generation focused on HBM3e memory (8 TB/s vs 3.35 TB/s) rather than dramatically increasing compute FLOPS. The compute improvement matters for training. The memory improvement matters for inference. And inference is 85% of enterprise AI spending in 2026.

    Google’s TurboQuant 6x inference optimization (which achieves 6x throughput improvements on Gemini models) works by reducing the precision of model weights, which reduces the amount of data that needs to be read from memory per token. Quantization (reducing weights from FP16 to INT4 or lower) is an algorithmic solution to a hardware problem: if you cannot increase memory bandwidth, reduce the amount of data that needs to flow through it. Every major inference optimization technique in 2026 (quantization, speculative decoding, KV-cache compression, mixture-of-experts routing) is fundamentally a technique for reducing memory bandwidth requirements.

    The HBM Supply Chain

    HBM (High Bandwidth Memory) is manufactured by three companies: SK Hynix (South Korea), Samsung (South Korea), and Micron (United States). SK Hynix holds approximately 50% market share for HBM3e, the current generation. Samsung and Micron split the remainder. HBM production requires advanced packaging technology (stacking multiple DRAM dies with through-silicon vias) that is capacity-constrained. The demand for HBM from AI GPU manufacturers exceeds current production capacity, which is why GPU delivery timelines extend 6 to 12 months and why GPU prices remain elevated despite increasing production volumes.

    The HBM supply constraint is the hidden bottleneck in the AI hardware supply chain. NVIDIA can design faster GPUs. TSMC can fabricate the GPU chips. But the complete GPU cannot ship without HBM, and HBM production scales more slowly than GPU demand. This constraint explains why NVIDIA’s data center revenue growth (while massive) is supply-constrained rather than demand-constrained. The company sells every GPU it can produce. The limit is how many GPUs it can produce, which is partially determined by HBM availability.

    What Comes After the Memory Wall

    The industry’s response to the memory wall operates on three timescales. In the near term (2026 to 2027), algorithmic optimizations (quantization, sparsity, KV-cache optimization) reduce memory bandwidth requirements without changing hardware. In the medium term (2027 to 2029), next-generation memory technologies (HBM4, with projected 2x bandwidth improvement over HBM3e) and compute-near-memory architectures (placing processing elements directly in the memory stack) attack the problem at the hardware level. In the long term (2029+), fundamentally new computing architectures (optical interconnects, photonic computing, neuromorphic chips) may eliminate the memory wall entirely by changing how compute and memory interact.

    For AI builders in 2026, the memory bandwidth constraint has immediate practical implications. Inference cost per token is determined by memory bandwidth utilization, not compute utilization. Optimizing inference means optimizing memory access patterns. The cheapest way to reduce inference costs is not to buy more GPUs. It is to reduce the memory bandwidth each inference request consumes through quantization, batching, and caching. The companies that understand this are the ones running inference profitably. The companies that throw compute at a memory-bound problem are the ones burning money on GPUs whose arithmetic units sit idle waiting for data.

    Sources: NVIDIA Blackwell architecture white paper (B200 specifications); NVIDIA DGX B200 system specifications; AMD MI300X technical specifications; Google TPU v5e documentation; SK Hynix HBM3e production data; Samsung/Micron HBM market share (TrendForce); Google TurboQuant technical blog; AnalyticsWeek inference economics analysis; NVIDIA GTC 2026 presentations.

  • How Google TurboQuant Compresses LLM Memory by 6x With Zero Accuracy Loss

    How Google TurboQuant Compresses LLM Memory by 6x With Zero Accuracy Loss

    How Google TurboQuant Compresses LLM Memory by 6x With Zero Accuracy Loss

    AI Research — March 26, 2026

    Google TurboQuant Compresses
    LLM Memory 6x. Zero Accuracy Loss.

    Google Research published TurboQuant: a KV-cache quantization algorithm that hits 3-bit compression with no measurable accuracy degradation on MMLU, GSM8K, and HumanEval. Here is the math and what it means for inference costs.

    6x
    Memory Reduction
    KV-cache compressed from 16-bit to 3-bit. 6x reduction in memory footprint.
    3-bit
    Target Precision
    Previous SOTA: 4-bit with accuracy loss. TurboQuant achieves 3-bit with zero loss.
    0%
    Accuracy Loss
    Verified on MMLU, GSM8K, HumanEval. No measurable degradation at 3-bit.
    KV
    Cache Target
    Key-value cache is the memory bottleneck for long-context inference. This is the right target.

    Sources: Google Research TurboQuant paper (arXiv); MMLU, GSM8K, HumanEval benchmark results; March 2026.

    Google Research published TurboQuant on March 25, 2026, a compression algorithm that reduces the key-value cache memory footprint of large language models by at least 6x while achieving zero measurable accuracy loss. The algorithm compresses KV cache values to 3 bits (down from the standard 16 bits), delivers up to 8x speedup on attention computation on NVIDIA H100 GPUs, and requires no training, fine-tuning, or calibration data. TurboQuant will be presented at ICLR 2026 in Rio de Janeiro alongside its two foundational methods: PolarQuant (AISTATS 2026) and QJL (AAAI 2025). The internet immediately called it Pied Piper.

    Memory chip stocks fell on the announcement. SK Hynix, Samsung, and Micron all dropped as investors calculated what happens to HBM demand if AI inference requires 6x less memory through software alone. Cloudflare CEO Matthew Prince called it “Google’s DeepSeek moment.” Whether the comparison holds depends on how fast TurboQuant moves from lab paper to production deployment.

    The Problem TurboQuant Solves

    When an LLM processes a conversation, it stores a running record of key-value pairs for every token in the context. This KV cache is the model’s working memory. For a 70-billion-parameter model serving 512 concurrent users, the KV cache alone can consume 512 GB of GPU memory, nearly four times the memory needed for the model weights. The KV cache grows linearly with context length. Every byte allocated to one user’s cache is a byte unavailable for another concurrent user. At 32K context, a single user’s cache approaches the size of the model itself. Double the context, halve your concurrent users.

    This is the binding economic constraint of LLM serving. It determines how many users a single GPU can handle, which determines revenue per GPU, which determines whether inference is profitable. Every architecture that shrinks the KV cache is directly attacking the most expensive bottleneck in AI deployment. TurboQuant attacks it with pure mathematics.

    How TurboQuant Works (The Two-Stage Method)

    TurboQuant uses a two-stage compression process that eliminates the overhead that makes most quantization techniques less effective than their headline numbers suggest. Traditional quantization compresses data vectors but must store additional normalization constants (one or two extra bits per number) that partially undo the compression gains.

    Stage 1 (PolarQuant) converts data vectors from Cartesian coordinates into polar coordinates, separating each vector into a magnitude and a set of angles. This geometric transformation makes the data more compressible because the angles have known statistical properties. PolarQuant then applies near-optimal quantization to the angular components, achieving high compression with minimal distortion. Stage 2 (QJL) applies the Johnson-Lindenstrauss Transform to the tiny residual error left from Stage 1. QJL reduces each residual to a single sign bit (+1 or -1), using just 1 bit of compression budget to eliminate the remaining bias in inner product estimates. The result: unbiased attention scores at 3 bits per value, with MSE distortion provably within a factor of approximately 2.7 of the information-theoretic lower bound.

    What the Benchmarks Show

    Google tested TurboQuant across LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval using Llama-3.1-8B-Instruct, Gemma, and Mistral models. At 3.5 bits per channel, TurboQuant achieved 100% recall on the Needle-in-a-Haystack benchmark up to 104K tokens, matching full-precision performance. Across all benchmarks, the compressed models scored identically to uncompressed baselines. The 4-bit mode achieves up to 8x speedup on H100 attention logit computation. TurboQuant consistently outperformed the existing KIVI baseline and all standard product quantization methods.

    Beyond LLM inference, TurboQuant improved vector search performance. Tested against RabbiQ and standard Product Quantization on the GloVe benchmark dataset, TurboQuant achieved superior recall ratios with virtually zero indexing time (0.0013 seconds for 1536-dimensional vectors). This matters because vector search underpins Google Search, YouTube recommendations, and advertising targeting.

    Why the Stock Market Reacted

    Honest Assessment of the Market Impact
    The fear: If AI inference requires 6x less memory through software, demand for HBM chips from SK Hynix, Samsung, and Micron drops proportionally. AI infrastructure spending ($490B projected for 2026) includes a significant memory component. A 6x compression could reduce the memory portion substantially.
    The reality check: TurboQuant has only been tested on models up to 8B parameters. It compresses KV cache (inference memory), not training memory. It does not reduce the memory needed for model weights, only for the working memory during generation. And Jevons’ Paradox applies: cheaper inference enables longer contexts and more concurrent users, which increases total memory demand.
    No production code yet: Google has not released official code or a library. Independent developers built implementations from the paper in PyTorch, MLX (Apple Silicon), and llama.cpp. Official open-source release is expected Q2 2026. The gap between lab paper and production deployment at data center scale is 6 to 18 months, not weeks.
    The real significance: TurboQuant approaches the information-theoretic limit for KV cache compression. There is not much room left to improve beyond this. The next efficiency gains will need to come from architectural changes (removing attention entirely, as Mamba-style models do), not from better compression of the existing KV cache.

    What This Changes for Edge AI

    A 6x reduction in inference memory means models that currently require an 80GB A100 for long-context inference could fit on a 16GB consumer GPU. Models that require a consumer GPU could fit on a laptop NPU. The Pied Piper comparison is appropriate in one specific way: TurboQuant could be the compression breakthrough that makes running capable LLMs on personal hardware practical. Independent developers built a working MLX implementation (for Apple Silicon) in 25 minutes using GPT-5.4. The Hugging Face community is already adapting it for llama.cpp, the most popular local inference framework.

    Google’s commercial motivation is clear. TurboQuant reduces the cost of running Gemini inference at scale. It also improves vector search performance, which directly affects Search, YouTube, and advertising revenue. Google did not publish this research for altruistic reasons. It published it because cheaper inference at higher quality is worth billions in reduced infrastructure costs annually. The algorithm is the plumbing for Google’s agentic AI era, where agents running multi-step workflows over long contexts need efficient memory to remain economically viable.

    Sources: Google Research blog, March 25, 2026; TechCrunch; VentureBeat; The Next Web; MarkTechPost; ICLR 2026 accepted paper; arXiv preprint (April 2025 original, March 2026 update).

    The Compression Ceiling

    TurboQuant’s MSE distortion is within a factor of 2.7 of the absolute theoretical limit (Shannon’s rate-distortion bound) across all bit-widths. At 1-bit compression, it is within a factor of 1.45 of optimal. This proximity to the information-theoretic boundary means there is very little room left for future compression improvements on the KV cache specifically. The next generation of inference efficiency will need to come from fundamentally different architectures: state-space models (Mamba), linear attention, or hybrid approaches that eliminate the KV cache bottleneck by design rather than by compression.

    That is the understated conclusion of the TurboQuant paper. It does not just solve the KV cache compression problem. It shows that the problem is nearly solved, period. Anyone hoping for another 6x improvement through better compression math will hit Shannon’s wall. The path forward runs through new architectures, not better codebooks. TurboQuant is likely the last major compression breakthrough for the attention mechanism as we know it. What replaces attention will determine whether the 6x improvement is the beginning of a new era or the final optimization of the current one.