
Google Research published TurboQuant on March 25, 2026, a quantization algorithm that compresses the key-value cache in large language models to 3 bits per value while maintaining accuracy across every benchmark the team tested. The KV cache is the working memory that stores one key-value pair for every token the model has processed. As context windows grow, that memory cost dominates the system budget. TurboQuant reduces it by at least 6x and delivers up to 8x speedup on attention computation using NVIDIA H100 GPUs. The paper will be presented at ICLR 2026 in Rio de Janeiro.
The result matters because KV cache memory is the binding constraint for most production LLM deployments. It determines how many concurrent users a single GPU can serve, how long a context window can extend before running out of memory, and how much it costs to run inference at scale. TurboQuant addresses that constraint through software alone, without requiring model retraining, fine-tuning, or access to the training data.
The Memory Tax on Every LLM Deployment
Decoder-only transformers, the architecture behind GPT, Claude, Gemini, and Llama, store a key and value vector for every token in the context window. A model processing 100,000 tokens accumulates 100,000 key-value pairs, each stored at full floating-point precision. For models with dozens of attention heads and thousands of dimensions per head, this memory footprint grows linearly with sequence length and can easily consume more GPU memory than the model weights themselves.
The standard approach to reducing this cost is quantization: replacing high-precision floating point values with lower-precision representations. Existing methods like KIVI and other KV cache quantizers can compress to 4 or 8 bits, but they introduce measurable accuracy degradation, especially on tasks that require finding specific information buried in long contexts. The core problem is that conventional quantizers must store normalization constants for every small block of data, adding 1 to 2 extra bits of overhead per number. That overhead erodes the compression gains.
This is not a theoretical concern. In production, KV cache memory limits determine whether you can serve a 128,000-token context at all, how many requests you can batch on a single GPU, and how much latency each request adds. For companies running inference at scale, every bit of KV cache precision translates directly into hardware cost.
How TurboQuant Eliminates the Overhead
TurboQuant combines two new techniques developed by a team including Amir Zandieh, Vahab Mirrokni, Praneeth Kacham, Majid Hadian, Insu Han, Majid Daliri, Lars Gottesburen, and Rajesh Jayaram, spanning Google Research, Google DeepMind, KAIST, and NYU.
The first technique, called PolarQuant, transforms the data from standard Cartesian coordinates into polar coordinates before quantizing. Conventional quantizers process each dimension independently, which requires per-block normalization to handle variation in scale. PolarQuant maps pairs of coordinates to a radius and angle. Because the angular distribution after a random rotation is concentrated and predictable, PolarQuant eliminates the normalization step entirely. No normalization means no overhead bits.
The random rotation is the critical trick. Before converting to polar form, TurboQuant applies a random orthogonal matrix (computed via QR decomposition of a random Gaussian matrix) to the data vectors. This “random preconditioning” step ensures that the data distribution becomes uniform enough for the polar quantizer to work optimally, regardless of the original data structure. It is training-free and data-oblivious: you do not need to see the data in advance to apply it.
The second technique, called QJL (Quantized Johnson-Lindenstrauss), handles the residual error left over after PolarQuant compression. QJL uses the Johnson-Lindenstrauss Transform to project the remaining error into a lower-dimensional space and reduces each value to a single sign bit, either +1 or -1. This adds zero memory overhead while eliminating the systematic bias that accumulates in the attention score computation. The combination of PolarQuant for primary compression and QJL for error correction is what gets TurboQuant to 3 bits per value with no measurable accuracy loss.
Benchmark Results Across Five Evaluations
Google evaluated TurboQuant on five standard long-context benchmarks: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. The test models were Llama-3.1-8B-Instruct and Mistral-7B, both open-source. On the Needle-in-a-Haystack evaluation, which requires locating a single target sentence hidden in 100,000 words of irrelevant text, TurboQuant achieved perfect recall, matching the scores of uncompressed 16-bit models.
On NVIDIA H100 hardware, 4-bit TurboQuant produced up to an 8x speedup in computing attention logits compared to uncompressed 32-bit keys. Memory reduction reached at least 6x relative to standard KV cache storage. Across all five benchmarks and both models, TurboQuant matched or exceeded the performance of every existing compression baseline the team tested, including KIVI and RaBitQ.
The algorithm also performs well on vector search tasks beyond LLM inference. In evaluations against Product Quantization and RabbiQ on the GloVe dataset (200 dimensions), TurboQuant achieved superior recall ratios across top-k retrieval while requiring virtually zero indexing time. For teams running retrieval-augmented generation pipelines or semantic search at scale, this dual applicability to both KV cache compression and vector similarity search is significant.
What Developers Should Know About Limitations
TurboQuant is not without caveats. The researchers acknowledge that poor random-seed handling in the rotation step could introduce small bias, though they argue the effect is negligible in high dimensions. The benchmarks used open-source models at the 7B to 8B parameter scale; behavior on much larger models (70B+) or on proprietary architectures with different attention implementations has not been publicly evaluated.
Compression breakthroughs have a long history of performing better in controlled benchmarks than in messy production environments. Whether TurboQuant maintains its quality neutrality across adversarial edge cases, extremely long multi-turn conversations, code generation tasks, and multilingual inputs remains to be demonstrated at scale.
Open-source code is widely expected around Q2 2026. Within 24 hours of publication, according to VentureBeat, community developers had begun porting TurboQuant to MLX (Apple Silicon) and llama.cpp. Integration into production serving frameworks like vLLM and Hugging Face text-generation-inference will likely follow once the reference implementation ships.
The Jevons Paradox and What Comes Next
The market reacted to TurboQuant in two phases. Memory chip stocks initially dipped on the logic that 6x compression means less memory demand. Morgan Stanley then published a note arguing the opposite: cheaper inference per query will drive higher query volumes, increasing total compute demand rather than decreasing it. This is the Jevons paradox applied to AI infrastructure, the same dynamic that played out when DeepSeek demonstrated that more efficient training does not reduce aggregate training compute.
Cloudflare CEO Matthew Prince publicly compared TurboQuant to a “DeepSeek moment” for inference, a breakthrough driven by algorithmic elegance rather than hardware brute force. The comparison is apt in structure if not yet proven in scale. DeepSeek changed how the industry thought about training costs. TurboQuant may do the same for serving costs, but only if the benchmarks hold up across the full diversity of production workloads.
Google’s timing is also strategic. Both TurboQuant and its component techniques (PolarQuant at ICLR 2026, QJL at AISTATS 2026) are being presented at top-tier venues in Q2. By releasing the research under an open framework, Google provides the foundational plumbing for an era of long-context, memory-efficient AI agents that can run on hardware organizations already own.
For developers and infrastructure teams, the practical takeaway is concrete. If you are running LLM inference and your KV cache memory is the bottleneck limiting context length or concurrent users, TurboQuant is the paper to read this month. It is training-free, data-oblivious, and the math is elegant enough that community ports are already appearing. The question is not whether the algorithm works on benchmarks. The question is how quickly it gets integrated into the serving stacks that production systems actually use.
Sources: Google Research, VentureBeat, TechCrunch, Help Net Security, SiliconANGLE, TurboQuant.net