Inference Optimization – My Written Word

Speculative Decoding: How LLMs Generate 3x Faster

May 18, 2026

Speculative decoding achieves 3-4x LLM speedup with zero output quality loss. The math proof, EAGLE-2’s 4.26x result, and when it does not help.

DeepSeek V4’s Hybrid Attention Cuts KV Cache by 10x. Here’s the Architecture.

May 2, 2026

DeepSeek V4-Pro processes one million tokens using 10% of the KV cache V3.2 needed. The mechanism is Hybrid Attention: two complementary compressors interleaved across 61 layers. Here’s how…

30 Days After QJL: What’s Actually Compressing the KV Cache

May 2, 2026

After QJL failed, three approaches own the KV cache frontier: TriAttention’s pre-RoPE selection, LRKV architectural compression, and adaptive bit-width.

Darkbloom Has 8 Security Layers, Not 4: What the Press Missed

April 18, 2026

Eigen Labs launched Darkbloom on April 15 as a decentralized inference network routing requests to idle Apple Silicon Macs. Every outlet has covered the four-layer privacy architecture. The…

Every Grok 4.20 Explainer Named the Four Agents. xAI’s Documentation Names Zero of Them.

April 9, 2026

xAI shipped Grok 4.20 multi-agent in February 2026. Every explainer published since then describes four named agents called Grok, Harper, Benjamin, and Lucas debating in parliament. Those names…

ASML Is the Only Company That Can Make AI Chips Possible. Its Next Machine Costs 0 Million.

ASML Is the Only Company That Can Make AI Chips Possible. Its Next Machine Costs $400 Million.

March 27, 2026

The current generation of ASML’s EUV machines is approaching the physical limit of what it can print. The High-NA EUV successor, at $400 million per unit, is now…

70 Million TB/s: The Three-Lever Mechanism Driving AI’s Memory Bandwidth Growth

March 27, 2026

NVIDIA’s B200 delivers 8 TB/s of HBM3e memory bandwidth per chip. Aggregate AI cluster bandwidth exceeds 70 million TB/s. Memory bandwidth, not compute FLOPS, is the bottleneck for…

How Google TurboQuant Compresses LLM Memory by 6x With Zero Accuracy Loss

March 26, 2026

Google Research published TurboQuant on March 25, 2026: a KV cache compression algorithm that reduces LLM inference memory by 6x at 3-bit precision with zero accuracy loss and…

Tag: Inference Optimization