DeepSeek V4’s Hybrid Attention Cuts KV Cache by 10x. Here’s the Architecture.

DeepSeek V4’s Hybrid Attention Cuts KV Cache by 10x. Here’s the Architecture.
DeepSeek V4’s Hybrid Attention Cuts KV Cache by 10x. Here’s the Architecture.

When DeepSeek dropped a preview of V4-Pro and V4-Flash on April 24, 2026, Bloomberg framed the story in geopolitical terms: a Chinese lab challenging OpenAI and Anthropic, working with Huawei Ascend silicon, raising at a $20 billion valuation. The more interesting story, and the one DeepSeek itself singled out under the name Hybrid Attention Architecture, is mechanical. According to DeepSeek’s own technical report, V4-Pro processes a one-million-token context using just 27% of the per-token inference FLOPs and 10% of the KV cache that DeepSeek-V3.2 required at the same length. V4-Flash pushes those numbers further, to roughly 10% of FLOPs and 7% of the KV cache. These are vendor self-reported figures from the model card and technical report; independent lab verification was not available at the time of writing. The numbers have not been meaningfully contested by the community, but treat them as DeepSeek’s own claims until replication arrives.

The release carries a “Preview” label that is not marketing hedging. DeepSeek has not given a finalization timeline, and the preview designation matters for production decisions: behavior may change, and the company explicitly recommends running workload-specific evaluation before committing. With that framing established, the architectural story is the part worth understanding in depth.

The core decision is to stop treating attention as a single uniform mechanism applied to every layer of the network and instead interleave two complementary attention variants, Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), alongside a small sliding-window branch. Hybrid Attention is a recognition that different layers in a deep transformer want different things from the past, and that paying full attention costs at every layer is wasteful when most of those layers can do their work over a heavily summarized view of the prefix.

The mechanism: two compressors with opposite tradeoffs

Both CSA and HCA begin from the same primitive: a learned token-level compressor that takes m consecutive tokens of the KV cache and replaces them with a single compressed entry. The two attention variants then make opposite tradeoffs from there.

Compressed Sparse Attention (CSA) uses a small group size (m = 4 in the released models, giving a 4x compression along the sequence dimension) and then applies DeepSeek Sparse Attention over the compressed stream. A Lightning Indexer, running in FP4 precision and using a learned ReLU-of-dot-product scoring function, ranks the compressed blocks for each query, and the model attends only to the top-k. In V4-Pro that top-k is on the order of 512 compressed entries, equivalent to roughly 2,048 raw tokens. CSA is the precise side of the hybrid: lightly compressed, query-dependent, designed to retrieve specific facts from a wide history.

Heavily Compressed Attention (HCA) runs the same kind of compressor but at a much higher ratio (m’ = 128, so 128x compression). At a million tokens that turns the prefix into roughly 7,800 compressed entries, short enough that the model can run dense attention over all of them. HCA discards sparse selection entirely. The compression itself does the work, and dense attention over the compressed stream becomes cheap. HCA is the broad side of the hybrid: an aggressively summarized global view, applied densely.

DeepSeek’s Hugging Face write-up is explicit about how these are arranged in V4-Pro’s 61-layer stack: layers 0 and 1 are pure HCA, layers 2 through 60 alternate CSA and HCA, and the multi-token-prediction block at the end runs sliding-window only. V4-Flash uses a similar interleaving but begins its first two layers with pure sliding-window attention. Each attention block in either variant carries a small sliding-window branch over the last 128 uncompressed tokens to preserve fine-grained local dependencies, plus a set of learnable attention sink logits added to the softmax denominator so heads can attend to less than unit mass, a useful out when nothing in the compressed history is actually relevant.

The V4 technical report documents a deliberate precision schedule that compounds with the structural compression: most KV entries are stored in FP8, the RoPE dimensions are kept in BF16, the Lightning Indexer’s QK path runs in FP4 with quantization-aware training, and MoE expert weights are FP4 throughout. According to the SGLang and LMSYS day-0 deployment write-up, every layer of V4 combines a 128-token sliding window with either C4 (top-512 sparse attention over 4:1 compressed KV) or C128 (dense attention over 128:1 compressed KV). The result is three coexisting KV pools per request, raw, lightly compressed, and heavily compressed, plus a state pool for in-progress compression. SGLang had to invent a new prefix-cache mechanism it calls ShadowRadix to keep them coherent across prefill, decode, and speculative decoding.

Why this is different from V3’s MLA

The natural comparison is to DeepSeek’s own previous attention story. V2 introduced and V3 inherited Multi-Head Latent Attention (MLA), which compresses keys and values into a low-rank latent vector before they hit the cache and projects them back up at use time. MLA gave DeepSeek a KV cache roughly 7x smaller than a vanilla MHA baseline at comparable quality, and the V2 ablations showed it outperforming both MHA and GQA. V3.2-Exp then layered DeepSeek Sparse Attention on top, using a Lightning Indexer to pick a top-k of about 2,048 historical tokens per query and reducing attention complexity from O(L2) to O(Lk).

V4’s Hybrid Attention is a different category of move. MLA compresses each token’s K and V along the hidden dimension. DSA selects which tokens to attend to along the sequence dimension. CSA and HCA compress along the sequence dimension itself, collapsing m or m’ tokens into one entry, then layer either DSA-style sparse selection (CSA) or dense compressed attention (HCA) on top. The mental model the technical report encourages is a coarse-to-fine memory: HCA gives a dense, blurry summary of the whole prefix; CSA gives a sharp lookup over a top-k of moderately compressed blocks; the sliding window keeps the last 128 tokens at full resolution. Putting all three on every layer would be wasteful, so the layers specialize and interleave. The win against MLA is multiplicative: MLA at FP8 plus 4x to 128x sequence compression plus FP4 indexers compounds into the 10x KV-cache reduction claimed against V3.2.

Architectural compression vs. selection-side compression

Hybrid Attention is an architectural compression technique, baked into the model and trained from scratch. Most of the recent work at the top of the literature attacks the problem from the selection side instead, post-hoc, on a model that was already trained with full attention. The full landscape of selection-side methods as of April 2026 covers TriAttention, LRKV, adaptive bit-width, and more.

TriAttention (arXiv 2604.04921, MIT/NVIDIA/Zhejiang, April 6) moves scoring to the pre-RoPE space, where Q and K vectors concentrate around fixed centers, and uses a trigonometric-series scoring function to retain only top-scoring keys. Its published numbers: 2.5x higher throughput at matched accuracy on AIME25, 10.7x KV-cache reduction at matched accuracy. All achieved without retraining the underlying model.

LoRC (NeurIPS 2024) approximates K and V weight matrices via low-rank decomposition, plug-in style, no retraining required. GQA and MQA share KV heads across queries. Llama 3 uses GQA with 8 KV heads for 32 query heads. All of these are valid attacks on the same memory wall, and they stack: a model could use GQA, FP8 KV quantization, and TriAttention selection simultaneously.

What V4 does that none of the post-hoc methods can do is buy an order of magnitude of headroom before the selection algorithm runs. By compressing the KV cache 4x or 128x along the sequence dimension at training time, V4 turns 1M tokens into either 250K or 7,800 entries before the indexer ever sees them. CSA’s top-k of 512 then operates on a 4x-shorter haystack than DSA in V3.2. The two paradigms are complementary: TriAttention and similar selection methods can be applied to V4’s compressed streams just as easily as to a raw KV cache. V4-Pro running through a TriAttention-augmented vLLM kernel is not a hypothetical but an obvious near-term composition.

Training a 1.6-trillion-parameter MoE with this attention layout

Hybrid Attention does not stand alone in the V4 technical report. Training a 1.6-trillion-parameter MoE backbone with this attention layout required two further innovations.

Manifold-Constrained Hyper-Connections (mHC) replace the residual stream with four parallel streams mixed by a learned matrix at every layer. Plain Hyper-Connections blow up at depth: DeepSeek’s own 27B experiments saw signal amplification exceeding 3,000x before the run diverged. mHC fixes this by constraining the residual mixing matrix to lie on the Birkhoff polytope, the manifold of doubly stochastic matrices where every row and column sums to one and every entry is non-negative. The constraint bounds the spectral norm at 1 and prevents amplification in either the forward or backward pass, enforced via Sinkhorn-Knopp with up to 20 normalization iterations.

The Muon optimizer replaces AdamW for most parameters, orthogonalizing the gradient update matrix using Newton-Schulz iterations so no single direction dominates. AdamW is retained only for embeddings, prediction heads, RMSNorm weights, static biases, and mHC gating factors. Two further stability tricks kept the loss curve clean: Anticipatory Routing, computing routing indices at step t using parameters from step t minus delta to break the feedback loop where bad routing reinforces outliers, and SwiGLU Clamping, capping the linear component to the range negative ten to ten. Pre-training ran on more than 32T tokens for V4-Flash and 33T for V4-Pro, with sequence length ramped from 4K to 16K to 64K to 1M.

What the benchmarks show

All benchmark numbers below are from DeepSeek’s own technical report and model card unless noted otherwise. Independent replication of the full benchmark suite had not been published at the time of writing.

V4-Pro-Max, the maximum reasoning effort mode, posts a Codeforces ELO of 3,206, the highest recorded for any model at release according to DeepSeek, ahead of the 3,168 posted by the nearest GPT-5 series model (attribution of exact GPT version varies across reviewers; treat the gap as meaningful but the specific model label as provisional). On LiveCodeBench it leads at 93.5%. On SWE-bench Verified it scores 80.6%, two-tenths of a point behind Claude Opus 4.6 at 80.8%. On GPQA Diamond it reaches 90.1%, independently confirmed via the public GPQA leaderboard.

These numbers place V4 competitively within the current open-weight frontier and within reach of models one generation back in the closed frontier. Base model gains are notable: V4-Pro-Base posts HumanEval 76.8% versus V3.2-Base’s 62.8%, and SimpleQA-Verified 55.2% versus 28.3%, a 26.9-point jump that DeepSeek attributes to improved training data and the new architecture. GLM-5.1, the 744B MoE released in April, scored 77.8 on SWE-Bench Verified from the same open-weight tier. For context on what the current closed frontier looks like: Claude Opus 4.7 scores 87.6% on SWE-bench Verified, and GPT-5.5 approximately 82.6%, both meaningfully ahead.

The long-context picture is where the architecture’s tradeoffs show clearly. On MRCR 8-needle at 1M tokens, V4-Pro scores 83.5%, trailing Claude Opus 4.6 at 92.9%. On CorpusQA 1M, V4 scores 62.0% to Opus 4.6’s 71.7%. The HuggingFace release write-up is honest: performance on the MRCR retrieval task holds strong through 256K tokens and degrades at 1M. Bloomberg Intelligence’s April 27 segment landed on a similar read: efficient and competitive, but not the lead-narrowing event some had anticipated.

Limitations: what compression costs

The V4 paper itself flags several open issues. The mHC and SwiGLU Clamping stability tricks are reported as empirical without theoretical grounding — DeepSeek acknowledges this. Several evaluations were run on internal harnesses and some comparison table cells were left blank because rival APIs failed to respond. The model ships as a preview with undefined finalization timeline.

The deeper limitation is structural. Aggressive KV compression is cheap precisely because most tokens get summarized, and rare-but-critical specific facts can be summarized away. Multiple independent reviewers reproduced this pattern: the headline 1M-context number is usable for many workloads but degrades unpredictably at the high end. BSWEN’s deployment write-up identifies three concrete operational limits: per-token compression overhead (real but small), top-k tuning that must be calibrated per workload (code analysis needs a higher k than summarization tasks), and implementation complexity because most inference frameworks needed substantial rework to support the three-pool KV layout.

This is also why the architectural-vs-selection debate matters for production agent memory architecture. A million-token context powered by Hybrid Attention is genuinely available to agent systems in a way that prior architectures made economically prohibitive. But the 256K reliability cliff means teams building long-running agents need to test their specific retrieval pattern against compressed contexts, not assume a million-token window behaves like a 128K window scaled up.

What happens next

The release carries two facts that are easy to underweight. First, both V4-Pro and V4-Flash ship under the MIT license, meaning commercial use, self-hosting, and fine-tuning without contacting DeepSeek. Second, the API pricing at launch is $1.74 per million input tokens and $3.48 per million output for V4-Pro, versus roughly $25 per million output for Claude Opus 4.7 and $30 for GPT-5.5. The benchmark gap between V4-Pro and the closed frontier is real and documented above. The pricing gap is also real. For cost-sensitive workloads where V4-Pro’s quality is sufficient, these numbers shift the decision materially.

The next generation of open-weight models will not be debating whether to add a selection-side compression on top of vanilla attention. The debate has shifted to which mix of CSA-style sparse compression, HCA-style dense compression, and sliding-window locality to interleave across layers, and how to compose those choices with the post-hoc compression methods that will continue evolving in parallel. V4 is the first open-weight model at frontier scale to report that training-time sequence compression at 128x can coexist with competitive benchmark performance. If that holds under independent long-context evaluation at depth, the KV cache memory wall that has defined long-context pricing and latency for three years starts to look like an engineering problem with a known class of solutions rather than a fundamental limit.

DeepSeek’s technical report is titled “DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence” and published alongside the model weights on Hugging Face at deepseek-ai/DeepSeek-V4-Pro under the MIT license. The LMSYS day-0 deployment write-up documenting ShadowRadix and the three-pool KV layout is at lmsys.org. The official model card for DeepSeek V4 is at deepseek.com.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading