How Google TurboQuant Compresses LLM Memory by 6x With Zero Accuracy Loss

AI Research — March 26, 2026

Google TurboQuant Compresses
LLM Memory 6x. Zero Accuracy Loss.

Google Research published TurboQuant: a KV-cache quantization algorithm that hits 3-bit compression with no measurable accuracy degradation on MMLU, GSM8K, and HumanEval. Here is the math and what it means for inference costs.

Memory Reduction

KV-cache compressed from 16-bit to 3-bit. 6x reduction in memory footprint.

3-bit

Target Precision

Previous SOTA: 4-bit with accuracy loss. TurboQuant achieves 3-bit with zero loss.

Accuracy Loss

Verified on MMLU, GSM8K, HumanEval. No measurable degradation at 3-bit.

Cache Target

Key-value cache is the memory bottleneck for long-context inference. This is the right target.

Sources: Google Research TurboQuant paper (arXiv); MMLU, GSM8K, HumanEval benchmark results; March 2026.

Update, May 2, 2026: Six teams later proved QJL fails for KV cache because softmax amplifies its variance, and three new approaches replaced it. Read the post-mortem and the May 2026 successor analysis.

Google Research published TurboQuant on March 25, 2026, a compression algorithm that reduces the key-value cache memory footprint of large language models by at least 6x while achieving zero measurable accuracy loss. The algorithm compresses KV cache values to 3 bits (down from the standard 16 bits), delivers up to 8x speedup on attention computation on NVIDIA H100 GPUs, and requires no training, fine-tuning, or calibration data. TurboQuant will be presented at ICLR 2026 in Rio de Janeiro alongside its two foundational methods: PolarQuant (AISTATS 2026) and QJL (AAAI 2025). The internet immediately called it Pied Piper.

Memory chip stocks fell on the announcement. SK Hynix, Samsung, and Micron all dropped as investors calculated what happens to HBM demand if AI inference requires 6x less memory through software alone. Cloudflare CEO Matthew Prince called it “Google’s DeepSeek moment.” Whether the comparison holds depends on how fast TurboQuant moves from lab paper to production deployment.

The Problem TurboQuant Solves

When an LLM processes a conversation, it stores a running record of key-value pairs for every token in the context. This KV cache is the model’s working memory. For a 70-billion-parameter model serving 512 concurrent users, the KV cache alone can consume 512 GB of GPU memory, nearly four times the memory needed for the model weights. The KV cache grows linearly with context length. Every byte allocated to one user’s cache is a byte unavailable for another concurrent user. At 32K context, a single user’s cache approaches the size of the model itself. Double the context, halve your concurrent users.

This is the binding economic constraint of LLM serving. It determines how many users a single GPU can handle, which determines revenue per GPU, which determines whether inference is profitable. Every architecture that shrinks the KV cache is directly attacking the most expensive bottleneck in AI deployment. TurboQuant attacks it with pure mathematics.

How TurboQuant Works (The Two-Stage Method)

TurboQuant uses a two-stage compression process that eliminates the overhead that makes most quantization techniques less effective than their headline numbers suggest. Traditional quantization compresses data vectors but must store additional normalization constants (one or two extra bits per number) that partially undo the compression gains.

Stage 1 (PolarQuant) converts data vectors from Cartesian coordinates into polar coordinates, separating each vector into a magnitude and a set of angles. This geometric transformation makes the data more compressible because the angles have known statistical properties. PolarQuant then applies near-optimal quantization to the angular components, achieving high compression with minimal distortion. Stage 2 (QJL) applies the Johnson-Lindenstrauss Transform to the tiny residual error left from Stage 1. QJL reduces each residual to a single sign bit (+1 or -1), using just 1 bit of compression budget to eliminate the remaining bias in inner product estimates. The result: unbiased attention scores at 3 bits per value, with MSE distortion provably within a factor of approximately 2.7 of the information-theoretic lower bound.

What the Benchmarks Show

Google tested TurboQuant across LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval using Llama-3.1-8B-Instruct, Gemma, and Mistral models. At 3.5 bits per channel, TurboQuant achieved 100% recall on the Needle-in-a-Haystack benchmark up to 104K tokens, matching full-precision performance. Across all benchmarks, the compressed models scored identically to uncompressed baselines. The 4-bit mode achieves up to 8x speedup on H100 attention logit computation. TurboQuant consistently outperformed the existing KIVI baseline and all standard product quantization methods.

Beyond LLM inference, TurboQuant improved vector search performance. Tested against RabbiQ and standard Product Quantization on the GloVe benchmark dataset, TurboQuant achieved superior recall ratios with virtually zero indexing time (0.0013 seconds for 1536-dimensional vectors). This matters because vector search underpins Google Search, YouTube recommendations, and advertising targeting.

Why the Stock Market Reacted

Honest Assessment of the Market Impact

The fear: If AI inference requires 6x less memory through software, demand for HBM chips from SK Hynix, Samsung, and Micron drops proportionally. AI infrastructure spending ($490B projected for 2026) includes a significant memory component. A 6x compression could reduce the memory portion substantially.

The reality check: TurboQuant has only been tested on models up to 8B parameters. It compresses KV cache (inference memory), not training memory. It does not reduce the memory needed for model weights, only for the working memory during generation. And Jevons’ Paradox applies: cheaper inference enables longer contexts and more concurrent users, which increases total memory demand.

No production code yet: Google has not released official code or a library. Independent developers built implementations from the paper in PyTorch, MLX (Apple Silicon), and llama.cpp. Official open-source release is expected Q2 2026. The gap between lab paper and production deployment at data center scale is 6 to 18 months, not weeks.

The real significance: TurboQuant approaches the information-theoretic limit for KV cache compression. There is not much room left to improve beyond this. The next efficiency gains will need to come from architectural changes (removing attention entirely, as Mamba-style models do), not from better compression of the existing KV cache.

What This Changes for Edge AI

A 6x reduction in inference memory means models that currently require an 80GB A100 for long-context inference could fit on a 16GB consumer GPU. Models that require a consumer GPU could fit on a laptop NPU. The Pied Piper comparison is appropriate in one specific way: TurboQuant could be the compression breakthrough that makes running capable LLMs on personal hardware practical. Independent developers built a working MLX implementation (for Apple Silicon) in 25 minutes using GPT-5.4. The Hugging Face community is already adapting it for llama.cpp, the most popular local inference framework.

Google’s commercial motivation is clear. TurboQuant reduces the cost of running Gemini inference at scale. It also improves vector search performance, which directly affects Search, YouTube, and advertising revenue. Google did not publish this research for altruistic reasons. It published it because cheaper inference at higher quality is worth billions in reduced infrastructure costs annually. The algorithm is the plumbing for Google’s agentic AI era, where agents running multi-step workflows over long contexts need efficient memory to remain economically viable.

Sources: Google Research blog, March 25, 2026; TechCrunch; VentureBeat; The Next Web; MarkTechPost; ICLR 2026 accepted paper; arXiv preprint (April 2025 original, March 2026 update).

The Compression Ceiling

TurboQuant’s MSE distortion is within a factor of 2.7 of the absolute theoretical limit (Shannon’s rate-distortion bound) across all bit-widths. At 1-bit compression, it is within a factor of 1.45 of optimal. This proximity to the information-theoretic boundary means there is very little room left for future compression improvements on the KV cache specifically. The next generation of inference efficiency will need to come from fundamentally different architectures: state-space models (Mamba), linear attention, or hybrid approaches that eliminate the KV cache bottleneck by design rather than by compression.

That is the understated conclusion of the TurboQuant paper. It does not just solve the KV cache compression problem. It shows that the problem is nearly solved, period. Anyone hoping for another 6x improvement through better compression math will hit Shannon’s wall. The path forward runs through new architectures, not better codebooks. TurboQuant is likely the last major compression breakthrough for the attention mechanism as we know it. What replaces attention will determine whether the 6x improvement is the beginning of a new era or the final optimization of the current one.

How Google TurboQuant Compresses LLM Memory by 6x With Zero Accuracy Loss

The Problem TurboQuant Solves

How TurboQuant Works (The Two-Stage Method)

What the Benchmarks Show

Why the Stock Market Reacted

What This Changes for Edge AI

The Compression Ceiling

Like this:

More posts

LLM Excessive Agency: Why Every Tool Your Agent Has Is a Risk

OWASP LLM Top 10 for 2025: The Mechanism Behind Each Vulnerability

Indirect Prompt Injection: The Attack That Hides in Your Data

Julia Bazinska and the Science of Measurable AI Security

How Google TurboQuant Compresses LLM Memory by 6x With Zero Accuracy Loss

The Problem TurboQuant Solves

How TurboQuant Works (The Two-Stage Method)

What the Benchmarks Show

Why the Stock Market Reacted

What This Changes for Edge AI

The Compression Ceiling

Share this:

Like this:

More posts

LLM Excessive Agency: Why Every Tool Your Agent Has Is a Risk

OWASP LLM Top 10 for 2025: The Mechanism Behind Each Vulnerability

Indirect Prompt Injection: The Attack That Hides in Your Data

Julia Bazinska and the Science of Measurable AI Security

Discover more from My Written Word