
6x
No
6+
20x
Google Research published TurboQuant on March 24, 2026, claiming 6x compression of the KV cache with zero accuracy loss. Memory chip stocks dropped. The AI community called it Google’s DeepSeek moment. Then independent developers actually implemented it and discovered something the paper doesn’t tell you: the algorithm’s key innovation, a component called QJL, makes KV cache performance worse in practice. Six independent teams across Python, C, Rust, and Triton confirmed the same finding within a week. The part that works is the simpler first stage. The part the paper emphasizes as novel doesn’t.
TurboQuant targets the single largest memory bottleneck in running large language models: the key-value cache. Every time a transformer generates a token, it stores key and value vectors for every previous token at every layer so it doesn’t recompute them. Llama 3 70B at 128K tokens burns 40 GB on the KV cache alone. That is more than most GPUs have. The cache grows linearly with context length, which means longer conversations and larger documents require proportionally more memory. Compressing the KV cache from 16-bit to 3 or 4 bits would let the same hardware handle dramatically longer contexts, serve more concurrent users, or run larger models.
How TurboQuant Actually Works
The algorithm has two stages. The first stage, PolarQuant, applies a random orthogonal rotation to each KV vector before quantizing it. This rotation spreads the energy of the vector uniformly across all coordinates. Without rotation, some coordinates carry 1,000x more energy than others, which makes uniform quantization wasteful. After rotation, every coordinate follows a predictable Beta distribution, which means you can precompute mathematically optimal quantization buckets using the Lloyd-Max algorithm once, ahead of time, with no calibration data and no model-specific tuning. Point it at any transformer and it works.
The second stage, Quantized Johnson-Lindenstrauss (QJL), allocates one bit per coordinate to correct for the bias that PolarQuant introduces. PolarQuant’s quantization systematically underestimates inner products. QJL projects the quantization residual through a random Gaussian matrix and keeps only the sign bits, producing an unbiased estimator of the true inner product. The combined system uses (b-1) bits for PolarQuant and 1 bit for QJL at any given bit budget b. The paper claims this two-stage design achieves near-optimal distortion, within 2.7x of the information-theoretic lower bound across all bit widths.
The benchmarks support the headline: on LongBench, Needle-in-Haystack, ZeroSCROLLS, and RULER tasks, TurboQuant at 3 bits matched FP16 quality on Gemma and Mistral models up to roughly 8 billion parameters. Attention computation ran up to 8x faster on H100 GPUs. No retraining, no fine-tuning, no calibration. These numbers are real. The problem is what happens when you try to use both stages together in a real inference pipeline.
Why the Key Innovation Doesn’t Work for KV Cache
Six independent implementations, built in Python, C, Rust, and Triton by teams with no coordination, converged on the same finding: removing QJL and allocating all bits to PolarQuant’s Lloyd-Max centroids produces better results than the two-stage design.
The mechanism is straightforward. QJL eliminates bias but introduces variance. For raw inner products, that tradeoff is favorable. But transformer attention runs inner products through softmax, and softmax exponentially amplifies variance. A small amount of random noise in every dot product gets magnified into large swings in the attention distribution. The scos-lab implementation measured 300% error with QJL enabled versus 7.6% without on GPT-2. The tonbistudio PyTorch implementation found that 0 out of 27 generation tests passed with QJL (V2), while 18 out of 18 passed without it (V3). Multiple llama.cpp contributors independently dropped QJL from their implementations after observing the same degradation.
The paper’s theoretical analysis is correct: QJL does produce unbiased inner product estimates. But the paper benchmarks against aggregate quality metrics like perplexity and task scores, not against per-token generation fidelity. When you run the full autoregressive decode loop, the variance from QJL accumulates across layers and tokens, producing visible degradation that summary metrics can mask.
There is a caveat. QJL works when you control the entire attention kernel and can feed in the two-part representation (PolarQuant centroids plus QJL sign bits) directly into the dot product computation. Through a standard attention path, where you must reconstruct the vector before computing attention, the reconstruction noise dominates. For most real deployments, PolarQuant alone, which the paper treats as the less interesting first stage, is the pragmatic choice. QJL also works for vector search (its other advertised use case), where there is no softmax.
An update in late March 2026 added nuance: one implementation found that using independent sign patterns for the PolarQuant rotation (Walsh-Hadamard Transform) and the QJL projection (Subsampled Randomized Hadamard Transform) actually improved perplexity. The story is still evolving. But the initial consensus among implementers holds: at 3+ bits, all bits to Lloyd-Max centroids outperforms the two-stage design.
What the Paper Doesn’t Benchmark
TurboQuant was tested on models up to roughly 8 billion parameters. The paper does not evaluate 70B or 405B scale models, which is exactly where KV cache compression matters most because the cache sizes become prohibitive. Community implementations have tested on larger models (Qwen3.5-35B-A3B showed 6.20 perplexity versus 6.19 baseline), but these are not from the paper authors.
The paper also does not address key-value asymmetry. In practice, key vectors and value vectors have different sensitivity to quantization. Keys determine which tokens the model attends to, requiring precision. Values are the content that gets averaged together, where errors cancel more naturally. Community benchmarks found that allocating 4 bits to keys and 2 bits to values (average 3 bits) dramatically outperforms uniform 3-bit allocation at the same bit budget. Some models exhibit extreme K/V norm ratios: Qwen models show key norms of 172 to 778 versus value norms of 2 to 4. For these architectures, a single compression scheme is insufficient.
A separate attribution controversy adds context. Researchers behind RaBitQ at ETH Zurich publicly raised concerns on Zhihu and OpenReview about structural similarities between TurboQuant and their prior work, specifically the core mechanism of random rotation followed by quantization. RaBitQ targeted vector databases at 1 bit per dimension and was published at SIGMOD 2025. TurboQuant targets KV caches at 3-4 bits. The underlying technique overlaps. The paper’s characterization of the relationship was called insufficient by the RaBitQ authors.
NVIDIA’s Competing Approach Does 20x
TurboQuant is not the only KV cache compression method at ICLR 2026. NVIDIA’s KVTC (KV Cache Transform Coding) achieves 20x compression with less than one percentage point of accuracy loss, tested on models from 1.5B to 70B parameters, a significantly wider range than TurboQuant’s benchmarks. KVTC uses PCA-based decorrelation and entropy coding borrowed from JPEG compression. Unlike TurboQuant’s data-oblivious design, KVTC requires a one-time calibration step per model to compute a PCA alignment matrix offline.
The tradeoff is architectural. TurboQuant works out of the box on any transformer with no preprocessing. KVTC delivers 3x more compression but needs calibration data and integrates into NVIDIA’s Dynamo inference framework. For cloud providers running a fixed set of models at massive scale, KVTC’s approach is likely superior. For developers running local inference on varied models, TurboQuant’s zero-configuration design is more practical. NVIDIA researcher Adrian Lancucki predicted the emergence of a dedicated, standardized compression layer, given structural similarities across model architectures.
What Actually Matters
Google released no code. Every working implementation was built by the community from the paper. As of early April 2026, no major inference framework has merged TurboQuant. Open pull requests exist in vLLM (three competing PRs), SGLang, llama.cpp, and MLX. The llama.cpp discussion thread alone has generated over 100 comments and spawned at least eight independent forks. This is unusual momentum for a research method.
The practical takeaway for anyone deploying LLMs: 4-bit KV cache compression is the current sweet spot. At 4 bits, quality is indistinguishable from FP16 on 3B+ parameter models. At 3 bits, quality degrades on models smaller than 8B. The rotation step (PolarQuant) is the real contribution. It transforms the quantization problem from intractable (outlier-dominated distributions) to tractable (uniform distributions with known optimal codebooks). QJL is an elegant theoretical addition that doesn’t survive contact with softmax.
The inference cost equation changes when KV cache drops to 3-4 bits. A model that hits out-of-memory at 16K context on a 16 GB GPU can push past that boundary without new hardware. For agentic workflows running through MCP, where context windows accumulate tool calls and intermediate results, compressed KV caches could be the difference between a viable local deployment and a cloud dependency. The algorithm that does this is simpler than the paper suggests. It is a rotation and a table lookup. The hard part was proving it was optimal.
Sources: Google Research blog (March 24, 2026). TurboQuant paper (arXiv:2504.19874, ICLR 2026). llama.cpp Discussion #20969. tonbistudio/turboquant-pytorch. scos-lab/turboquant. TechCrunch. NVIDIA KVTC (ICLR 2026). DEV Community implementation guide.









You must be logged in to post a comment.