Speculative Decoding: How LLMs Generate 3x Faster

Speculative Decoding: How LLMs Generate 3x Faster
Speculative Decoding: How LLMs Generate 3x Faster

Every token a large language model generates requires one full forward pass through hundreds of billions of parameters. On a single H100 GPU, a 70B model runs at roughly 40-60 tokens per second for an interactive user, fast enough to read but far too slow for latency-sensitive applications like real-time code completion or multi-step agents that chain dozens of model calls together.

Speculative decoding solves this without touching model weights, without quantization, and without any sacrifice in output quality. The technique, published independently by Leviathan, Kalchbrenner, and Matias at Google Brain in 2023 and by Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper at DeepMind the same year, achieves 2-4x latency reductions that compound with other inference optimizations.

Understanding why it works, including the rejection sampling proof that guarantees identical output distributions, is now foundational knowledge for any engineer building production systems on top of large language models.

The Core Problem: Sequential Generation Is Memory-Bound, Not Compute-Bound

The common explanation for why LLM inference is slow blames model size. The more precise diagnosis is that autoregressive generation is memory-bandwidth-bound, not compute-bound.

Each decoding step transfers the entire model’s parameters from GPU high-bandwidth memory (HBM) to on-chip cache. A 70B model in fp16 occupies roughly 140GB of parameters. Every single token requires loading all of that into cache, a memory transfer that consumes nearly the same time whether the model is completing a trivial phrase or solving a complex reasoning problem. GPU compute units sit mostly idle during this transfer.

This is the key structural insight: GPU compute is chronically underutilized during autoregressive decoding. Speculative decoding exploits that idle capacity by making multiple tokens available for parallel verification in a single forward pass.

The Draft-Verify Architecture

The speculative decoding loop runs two phases per iteration.

In the draft phase, a small fast model (typically 7B parameters or fewer) generates K speculative tokens autoregressively. The draft model is cheap to run because it is small. Its job is not to be right about every token but to be right often enough to earn significant acceptance rates.

In the verify phase, the target model runs a single forward pass that processes all K draft tokens simultaneously, treating them as a sequence rather than generating them one at a time. This looks architecturally identical to the prefill phase of normal LLM inference, and the GPU processes all K tokens in parallel, exploiting the compute capacity that autoregressive decoding wastes.

If the target model agrees with the draft on the first k tokens (where k is between 1 and K), those k tokens are accepted and appended. When the target model rejects a draft token at position j, it substitutes its own sample for that position and discards subsequent draft tokens. Either way, each iteration produces at least one token, and in high-acceptance-rate scenarios, consistently produces four to five.

The expected number of tokens accepted per iteration is called the average acceptance length (tau). If tau equals 3.5, speculative decoding needs roughly 1/3.5 as many full target-model passes as autoregressive decoding. That ratio maps directly to latency reduction.

The Rejection Sampling Math: Why Output Quality Is Preserved

The non-obvious component of speculative decoding is the sampling correction step. Accepting tokens from a draft model’s distribution could bias output toward that smaller model. The rejection sampling algorithm prevents this precisely.

Let p(x) denote the target model’s probability distribution over vocabulary tokens at a given position, and q(x) the draft model’s distribution. For each draft token x drawn from q, the algorithm accepts it with probability:

min(1, p(x) / q(x))

When a token is rejected, the algorithm samples a corrected replacement from:

normalize(max(0, p(x) – q(x)))

Leviathan et al. and Chen et al. both prove that this procedure produces samples distributed exactly as p(x), the target model’s distribution, regardless of what q(x) looks like. Crucially, this is not an approximation. When draft and target agree perfectly, all draft tokens are accepted with probability 1. When they disagree completely, the algorithm falls back to exactly the target model’s output. The proof holds for any draft model and any target model over any shared vocabulary.

This mathematical guarantee is what makes speculative decoding production-safe. It is not model distillation, not a lower-quality approximation, and not model compression. The output distribution is provably identical to running the target model alone.

EAGLE: Feature-Level Drafting

The original speculative decoding papers use a separately trained, smaller version of the target model as the draft model. This works but has a fundamental limitation: the draft model predicts tokens from scratch, using the same input representations as the target model.

EAGLE, published by Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang (Peking University and Microsoft Research, 2024), rethought the draft model by observing that the target LLM’s second-to-last layer features carry far more predictive information about the next token than raw input embeddings. EAGLE trains a lightweight autoregressive head that predicts future feature vectors directly, reasoning in the target model’s representation space rather than from input tokens.

This change raises acceptance rates substantially. The draft model makes fewer errors because it works from richer, higher-level representations. EAGLE achieved average acceptance lengths of three to four tokens across Vicuna and LLaMA-2 series evaluations, compared to two to three for standard speculative sampling with size-comparable draft models.

EAGLE-2: Context-Dependent Draft Trees

EAGLE used a static draft tree, the same K speculative tokens allocated in the same structure regardless of generation context. This assumption has a subtle flaw that the follow-up work addressed directly.

Acceptance rates for draft tokens are not only position-dependent but strongly context-dependent, as the EAGLE-2 paper (Li, Wei, Zhang, Zhang, EMNLP 2024) demonstrates experimentally. When the target model completes a deterministic phrase (a common code pattern, a well-known mathematical identity, a numerical sequence), acceptance rates push toward 100%. When the model generates genuinely uncertain content, acceptance rates fall below 40%.

EAGLE-2 exploits this by dynamically sizing the draft tree based on the draft model’s own confidence scores. The paper shows that EAGLE’s draft model is well-calibrated: its confidence approximates actual acceptance rates with small errors. EAGLE-2 uses this calibration to expand the draft tree during high-confidence contexts and shrink it when uncertainty is high, avoiding wasted compute on branches unlikely to be accepted.

Across evaluations on three LLM series (Vicuna, LLaMA-2, LLaMA-3) and six generation tasks, EAGLE-2 achieved speedup ratios of 3.05x to 4.26x, which is 20 to 40% faster than EAGLE-1, which was already faster than standard speculative sampling.

Self-Speculative Decoding: No Separate Draft Model Required

A significant practical barrier to deploying speculative decoding is the draft model itself. You need a compatible smaller model, which requires additional storage, memory management, and operational complexity. Self-speculative decoding removes this requirement.

The insight is that many LLMs produce reasonable predictions long before the final transformer layer. Self-speculative decoding uses the target model’s own early layers as the draft model, generating speculative tokens via early exit from the forward pass, then verifying with the full model in the standard way.

Meta AI’s LayerSkip (2024) formalized this approach by training models with early-exit regularization, enabling 1.3-2.5x speedups without a separate draft model. The tradeoff is that self-speculative methods achieve lower acceptance rates than EAGLE-class approaches, because early-exit distributions differ more from final-layer distributions than a purpose-trained draft model would. For teams unable to deploy an additional draft model, self-speculative decoding is the practical path to inference acceleration with a single model binary.

Verification as Prefill: The GPU Utilization Insight

The mechanism that makes speculative decoding computationally efficient is worth understanding precisely. Standard prefill (processing the input prompt before generation begins) is compute-bound on modern hardware, the batch of prompt tokens can saturate GPU compute units. Autoregressive decoding is memory-bandwidth-bound, one token at a time, underutilizing compute. Speculative decoding converts the verification step into a prefill-like operation where all K draft tokens are processed in a single parallel forward pass.

The verification step is therefore compute-bound, not memory-bandwidth-bound. The GPU runs at closer to peak utilization during verification than during standard decoding. This is why the throughput gains are real rather than illusory: the hardware is doing more useful compute per unit wall time.

Xiaoxuan Liu’s 2025 UC Berkeley PhD thesis on efficient LLM systems with speculative decoding formalizes this framing, recharacterizing speculative decoding as a verification efficiency problem rather than a drafting problem, and identifying adaptive and selective verification as the primary surface for future improvement.

Apple’s Mirror-SD: Breaking the Latency-Acceptance Tradeoff

Apple published Mirror Speculative Decoding (Bhendawade, Nishu, Kundu, Bartels, Cho, Belousova, December 2025) to address a constraint that applies to all existing approaches: increasing draft size K raises acceptance rates but also adds latency overhead in the autoregressive draft phase. EAGLE-2 partially addresses this through dynamic tree sizing, but the tradeoff remains.

Mirror-SD introduces Principled Coarse-Graining (PCG), which verifies proposals at a coarser granularity. Rather than verifying tokens individually, PCG groups draft tokens and uses the target model’s distribution over groups to accept or reject cohesively. This enables larger effective draft sizes without proportional overhead in the draft phase, breaking the latency-acceptance frontier that earlier methods traced.

The research was validated on Apple Silicon, reflecting practical interest in speculative decoding for on-device inference, a setting where memory bandwidth constraints are more severe than on data center GPUs, making the architectural benefit of speculative decoding correspondingly larger.

Choosing the Right Draft Model

The compatibility requirement is the binding constraint in draft model selection. Draft and target models must share the same tokenizer vocabulary, token indices must correspond one-to-one. In practice, this confines speculative decoding to within-family pairings: a LLaMA-3-8B model drafting for a LLaMA-3-70B target, or a Vicuna-7B drafting for a Vicuna-33B target.

The parameter ratio matters too. A draft model that is too large adds latency overhead in the draft phase that consumes the gains from fewer target model calls. A draft model that is too small produces low acceptance rates. Empirically, a 10:1 to 15:1 parameter ratio works well across most workloads. A 7B draft for a 70B target is the canonical configuration.

For teams using EAGLE-style feature-level drafting, the draft head is far smaller than a standard draft model, often just one or two transformer layers trained on top of the target model’s existing representations. This is why EAGLE-class methods consistently outperform standard speculative sampling in published benchmarks: the draft head benefits from the same representations that the target model uses, rather than learning independently from scratch.

Evaluating Your Workload Before Deployment

The most common mistake when adopting speculative decoding is using published benchmark speedup numbers without measuring acceptance rates on a representative production sample. Average acceptance length tau is strongly task-dependent. The same model pair might achieve tau = 4.8 on code completion tasks and tau = 1.9 on open-ended creative generation. Published 3-4x speedup figures typically come from mixed benchmarks that average these cases, and may substantially overstate gains for workloads weighted toward high-uncertainty generation.

The correct evaluation sequence: collect a representative sample of production inputs, run speculative decoding with the candidate draft model, measure average acceptance length, compare wall-clock latency at your target batch size, and benchmark against alternative optimizations including quantization and batching changes.

vLLM logs draft acceptance rates per request, which makes this analysis straightforward. A practical threshold: if tau consistently falls below 2.0 on your production distribution, the overhead of managing the draft model likely outweighs the latency savings. At tau above 3.0, speculative decoding outperforms most alternative single-optimization approaches for interactive latency reduction.

Batch Inference: Where Speculative Decoding Does Not Help

Batch inference is where speculative decoding consistently underperforms. In batched serving, multiple requests are processed simultaneously, and the batch can saturate GPU compute units, making the operation compute-bound. Speculative decoding addresses memory-bandwidth bottlenecks, which are not the binding constraint in batched settings.

NVIDIA’s reported 3.6x throughput improvement on H200 GPUs applies specifically to single-request low-latency settings. For high-throughput batch serving, continuous batching and PagedAttention, vLLM’s core techniques, remain more impactful than speculative decoding. The two optimizations address fundamentally different bottlenecks: speculative decoding targets per-request latency, PagedAttention targets KV cache memory efficiency at scale.

Foundation models used for batch inference over large datasets, including the class of genomic foundation models like Arc Institute’s Evo 2, which process millions of sequences, are better served by batching optimizations than by speculative decoding. The latency reduction that speculative decoding provides is most valuable in interactive deployments where per-request response time drives user experience.

Production Implementations in 2025-2026

All three major open-source serving frameworks now include native speculative decoding support. vLLM supports configurable draft model selection and dynamic speculation length, with separate KV cache management for draft and target models. TensorRT-LLM from NVIDIA includes CUDA-optimized draft-verify kernels with reported 3.6x throughput improvement on H200 hardware for interactive use cases. SGLang supports EAGLE-style feature-level drafting, which achieves higher acceptance rates than size-matched conventional draft models in most benchmarks.

The combination of 4-bit quantization via QLoRA and EAGLE-2 speculative decoding can achieve 5-7x latency reduction over a full-precision autoregressive baseline, at some cost to output quality from quantization. A quantized draft model on a quantized target model still benefits from the draft-verify architecture, and the acceptance rate analysis applies to quantized distributions without modification.

Limitations and Open Questions

The acceptance rate ceiling is the technique’s fundamental limitation. Even EAGLE-2’s best results show tau well below K. You cannot accept all draft tokens in the general case, because the draft model is not the target model. Tasks with high per-token uncertainty will always see low acceptance rates, and no amount of draft model improvement can eliminate this without making the draft model equivalent to the target model.

The draft model maintenance burden is real in production. When the target model is updated (new fine-tune, new instruction tuning, new version), the draft model may need retraining to maintain compatibility. Teams that deploy speculative decoding inherit a two-model deployment and update pipeline rather than a one-model pipeline.

Self-speculative decoding solves the maintenance problem but achieves lower speedups. EAGLE solves the speedup problem but requires a separately managed draft artifact. No current approach eliminates both constraints simultaneously, though Mirror-SD’s direction (improving the efficiency-accuracy frontier through principled verification algorithms rather than better drafting) may eventually reduce the dependence on high-quality draft models altogether.

For practitioners today, speculative decoding is the correct first optimization to evaluate for interactive LLM deployments where per-request latency is the primary constraint. The technique is mature, production-ready, framework-supported, and backed by a proof that output quality is preserved exactly. The question is not whether to use it but which variant fits the deployment constraints, and that depends on whether maintaining a draft model is acceptable in your infrastructure. The inference-cost argument driving compute-optimal training choices applies equally here: smaller inference-optimal models are easier draft targets and achieve higher acceptance rates with proportionally smaller draft heads.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading