30 Days After QJL: What's Actually Compressing the KV Cache

Three weeks ago I covered why six independent teams concluded that TurboQuant’s QJL stage fails for KV cache compression. The mechanism was clean: softmax exponentially amplifies variance, and QJL’s unbiased one-bit residual correction is a variance source that gets eaten alive in the autoregressive decode loop. PolarQuant rotation survived. QJL did not.

What replaced it is more interesting than what failed. In April 2026, three approaches moved into the slot QJL was supposed to occupy, and none of them do quantization. TriAttention from MIT, NVIDIA, and Zhejiang University compresses by selection. LRKV from fin.ai compresses by architecture. Adaptive per-token bit-width controllers compress by allocation. They are orthogonal to each other, orthogonal to PolarQuant, and they stack.

The headline number worth tracking is no longer 6x. With the right combination, the long-context KV footprint is now reducible by an order of magnitude beyond what TurboQuant claimed, and unlike the original two-stage paper, none of the survivors need to pretend their key innovation works.

Where post-QJL KV compression actually lives

The starting point for any 2026 deployment is the simpler-than-the-paper-suggests fact that PolarQuant alone is the entire useful contribution of TurboQuant. The random rotation transforms a non-uniform distribution with heavy outlier tails into a uniform Beta distribution where Lloyd-Max scalar quantization lands at near-optimal bits-per-coordinate without any per-group metadata. For 4-bit KV at the H100 memory hierarchy, this is the floor. Everything else stacks on top.

The question that drove April’s papers is what to add. Quantization gets you about 4x before quality degrades. The remaining compression has to come from somewhere else. Three places, specifically: dropping tokens that do not matter (selection), reducing the per-head dimensionality of the cache (architecture), or spending bits where they help most and skipping them where they do not (allocation). Each of the three approaches that landed in April attacks one of those axes.

TriAttention: selection without query-side guesswork

The dominant existing approach to KV selection is to estimate which tokens future queries will attend to and evict the rest. SnapKV, H2O, and R-KV all run this play. They look at attention scores from recent post-RoPE queries, take the top-k by accumulated attention, and drop the others. The math is simple and the implementations are mature. The accuracy on long reasoning is also bad. R-KV scores 17.5% on AIME25 with an aggressive cache budget. Full Attention on the same budget scores 32.9%.

The failure mode is structural, not implementational. Rotary Position Embedding rotates Q and K vectors with token position. When you sample recent queries to estimate which keys are important, the queries you sampled have rotated to a specific phase, and they are not representative of all the queries that will eventually attend to a key. Importance estimation built on a moving reference frame is unstable.

The TriAttention authors took a geometric step backward. Before RoPE rotates anything, Q and K vectors in long-reasoning models concentrate around fixed non-zero centers. The concentration is empirical and reproducible across models. Once you observe that the distribution has a center, the rest follows analytically. The dot product between a query at the center and a key at the center decomposes into a trigonometric series indexed by their relative position. The series determines which distances each query prefers, with the centers fixing the parameters. You can score every key in the cache by this trigonometric quantity without ever sampling a representative query, because the geometry of the pre-RoPE space already tells you which keys at which positions a query is likely to want.

The benchmarks land where the theory predicts. On AIME25 with 32K-token generation budgets, TriAttention matches Full Attention accuracy at 2.5x throughput or 10.7x KV memory reduction. On MATH 500, with only 1,024 tokens kept out of a 32,768-token cache, the model scores 68.4% versus Full Attention’s 69.6%. The gap to existing baselines is wide: 32.9% versus R-KV’s 17.5% on AIME25 with the same budget is a 15.4 percentage point swing.

Code is at github.com/WeianMao/triattention under Apache 2.0, with an MLX port for Apple Silicon already shipping.

LRKV: cutting architectural redundancy nobody had named

The second place compression hides is across attention heads. In standard multi-head attention, every head holds its own full-rank key and value projection. The redundancy is well known. MQA shares K and V across all heads. GQA groups heads and shares within groups. Multi-Latent Attention compresses everything into a single per-token latent and reconstructs heads on the fly. Each of these is a coarse partition of the design space: complete sharing or complete independence at architecture-design time.

Low-Rank Key-Value attention takes the continuous version of that tradeoff. Each layer maintains a shared full-rank KV projection that acts as a global basis. On top of that, each head learns a low-rank residual specific to itself. The cache stores the shared projection once per layer and the per-head residuals at low rank, instead of full-rank keys and values for every head. The continuous parameter is the rank of the residual: rank zero collapses to MQA, full rank recovers full MHA, and the interesting territory is in between.

The empirical results are unusually clean. Across pretrained models from 128M to 6.3B parameters, LRKV achieves the lowest test loss among MHA, MQA, GQA, and MLA, while using only 45 to 53 percent of MHA’s KV cache. It reaches equivalent baseline quality 18 to 25 percent faster in training steps. After supervised midtraining, it leads on ARC-Easy, ARC-Challenge, MMLU, GSM8K, and HumanEval. The combination of better quality and smaller cache is the rare result in this space, and it lands because the per-head residuals are doing real work that pure sharing schemes cannot.

The hard limit is that LRKV is an architectural change applied at training time. Existing models trained with standard MHA cannot be retrofitted to LRKV without either training from scratch or a substantial midtraining run. For new model releases, this is the path. For everyone running Llama 3.1 or Qwen 3.5 in production today, LRKV does not help directly. DeepSeek V4’s Hybrid Attention, released April 24, is the highest-profile confirmation that this architectural bet pays off at frontier scale: its Compressed Sparse Attention and Heavily Compressed Attention layers achieve a vendor-reported 10x KV cache reduction versus V3.2 at 1M context, trained from scratch into a 1.6T MoE. The full mechanism, how CSA and HCA differ from LRKV, and where the architecture degrades past 256K tokens is covered in full.

Adaptive bit-width: the allocation axis

The third compression vector is letting the bits-per-coordinate vary by token. Most quantization schemes pick a uniform precision for the whole cache. Adaptive KV-Quant, released in early April, trains a small controller that decides how many bits to allocate to each token based on activation statistics observed at inference time. Tokens with high attention concentration get more bits. Tokens that are unlikely to be attended to get fewer. The total cache budget is fixed. The per-token allocation is learned.

The pattern matters more than the headline. On-device LLMs are the natural home for this approach because the device cannot afford to over-provision precision for every token, but it can afford a few tens of kilobytes of controller weights. The controller wraps an underlying quantization backend such as PolarQuant, which means adaptive bit-width is not a competitor to TurboQuant’s surviving stage but a layer that uses it.

The harder question is whether adaptive controllers trained on one model generalize to others. Early benchmarks suggest yes within a model family, but the cross-architecture story is not yet validated.

The hardware path catching up

Quantization, selection, and architecture are software answers. The hardware answer is native low-precision support in the GPU itself. NVIDIA’s Blackwell SM100 and SM120 chips ship with native FP4 multiply-accumulate instructions. SGLang merged a strategy abstraction in early April that lets KV cache live in NVFP4 on those chips, eliminating the dequantize step entirely for attention computation. The implementations are still moving, but the directional bet is that the next-generation cache lives in FP4 with hardware-level support, and the software-side schemes have to adapt to that floor.

This is also where NVIDIA’s KVTC method becomes structurally interesting. As I covered in the original TurboQuant analysis, KVTC achieves 20x compression with a one-time PCA calibration per model, tested across 1.5B to 70B parameters, and integrates into NVIDIA’s Dynamo inference framework. KVTC is not portable in the way TurboQuant tried to be, but for cloud providers running a fixed set of models at scale on NVIDIA hardware, the calibration cost is amortized over millions of inference calls. The combination of NVFP4 hardware support and KVTC’s calibration-based decorrelation is the path of least resistance for the largest deployments.

What stacks with what

The cleanest mental model: the four components compress orthogonal axes, and most pairs combine cleanly.

PolarQuant compresses precision per coordinate. TriAttention compresses tokens per cache. LRKV compresses heads per layer. Adaptive bit-width compresses bits per token. Multiplying them is not literal because correlations exist, but the directional reduction is real. PolarQuant 4-bit (4x) plus LRKV (2x) plus TriAttention selection at the 10.7x level lands closer to 80x than to 6x on the workloads where all three apply, which means long-reasoning generation on architectures designed for it.

The narrower deployment story is that the surviving piece of TurboQuant, PolarQuant rotation, is now a building block rather than a complete answer. Anyone deploying long-context inference today has a much richer toolkit than the 6x compression headline suggested in March. The QJL detour cost the community three weeks of confusion. The methods that replaced it are stronger.

Limitations and what to actually deploy

Three limits to name plainly.

TriAttention is selection-based, which means it drops tokens. For reasoning workloads where most tokens are intermediate scratch and a few carry the load, the tradeoff is excellent. For tasks where every token matters (verbatim recall, long-document summarization with specific quote requirements, legal text retrieval), aggressive selection still costs accuracy that the published benchmarks do not measure.

LRKV is an architectural change applied at training time. The papers that show 45 to 53 percent reduction with lower test loss are pretraining results. Retrofitting LRKV to an existing model trained with MHA via midtraining is plausible but the published evidence is thin. Production deployments that want LRKV’s gains today will need to wait for model releases that ship with the architecture.

Adaptive bit-width controllers are model-specific in their current form. Cross-architecture generalization is an open question. For deployment teams running a single model family at scale, this is fine. For platforms serving heterogeneous models, the operational overhead of training and shipping per-model controllers is not yet justified by the marginal compression gain over a strong fixed-precision baseline.

The pragmatic deployment recipe for May 2026 is unchanged from the conclusion of the QJL post-mortem: PolarQuant rotation at 4 bits per coordinate is the table-stakes baseline. TriAttention sits on top of that for long-reasoning workloads where token selection is acceptable. LRKV is the bet to make for the next model you train, not the model you are running. Adaptive bit-width remains experimental until cross-model generalization improves.

What to watch through the rest of Q2

Three signals to track. First, vLLM and SGLang merging the post-QJL methods. The pull request volume on TurboQuant integrations stalled when the QJL findings landed. The new wave of integrations targets PolarQuant-only paths, with TriAttention and LRKV-aware kernels arriving as separate efforts. Watch the SGLang strategy abstraction for which combinations it canonicalizes.

Second, the ICLR 2026 presentation. The TurboQuant paper is still scheduled despite the community findings, and the authors are likely to address the implementation gap in the talk. Whether Google ships an official reference implementation that matches the community results, or whether the conference version of the paper acknowledges the per-stage analysis, will determine how much of the original framing survives.

Third, the Blackwell rollout. Native FP4 KV cache support changes the calculus for everything above it. If the hardware-level path lands cleanly with KVTC integration, the open question becomes whether software methods like TriAttention and LRKV continue to deliver complementary gains on top of native FP4, or whether they get absorbed into NVIDIA’s Dynamo-resident compression layer.

The KV cache compression frontier is wider, more honest, and more useful than it looked thirty days ago. None of the methods that survived require pretending their key innovation works.

Papers: TriAttention (Mao et al., MIT/NVIDIA/Zhejiang University, arXiv:2604.04921, April 2026), Low-Rank Key-Value Attention (O’Neill et al., fin.ai, arXiv:2601.11471, January 2026). Implementations: WeianMao/triattention. Prior coverage: QJL findings post-mortem, original TurboQuant explainer.

30 Days After QJL: What’s Actually Compressing the KV Cache

Where post-QJL KV compression actually lives

TriAttention: selection without query-side guesswork

LRKV: cutting architectural redundancy nobody had named

Adaptive bit-width: the allocation axis

The hardware path catching up

What stacks with what

Limitations and what to actually deploy

What to watch through the rest of Q2

Like this:

More posts

DeepSeek V4’s Hybrid Attention Cuts KV Cache by 10x. Here’s the Architecture.

WebMCP Is Not MCP: What Chrome’s modelContext Actually Ships