Blog

  • LoRA and QLoRA: Fine-Tuning Large Models on One GPU

    LoRA and QLoRA: Fine-Tuning Large Models on One GPU

    LoRA and QLoRA: Fine-Tuning Large Models on One GPU

    Full fine-tuning a 7B language model requires between 100 and 120 gigabytes of GPU memory. That means at minimum two A100 80GB cards and roughly $50,000 in hardware just to run a single training job. For a 70B model, the math gets worse by a factor of ten.

    LoRA changed this calculation in 2022. QLoRA changed it again in 2023. Together, they made serious fine-tuning of large language models possible on a single consumer GPU. A 7B model now fine-tunes on a $1,500 RTX 4090. A 70B model on a single A100 80GB. The technique is not approximate or low-quality. On most tasks, LoRA with well-chosen rank recovers 90-95% of full fine-tuning performance while training less than 0.5% of the model’s parameters.

    Most explanations of LoRA describe what it does without explaining why it works. The answer comes from a 2021 paper on intrinsic dimensionality that is rarely cited in practitioner guides, even though it is the empirical foundation the LoRA authors explicitly built on.

    The Intrinsic Dimension Hypothesis: Why Weight Updates Are Low-Rank

    In 2021, Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta at Meta AI published a paper that should be required reading for anyone working with fine-tuning: “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning” (ACL 2021).

    The paper measured how many trainable parameters a language model actually needs during fine-tuning by projecting all gradients into a random low-dimensional subspace. The question was: how small can that subspace be while still reaching 90% of the full fine-tuning performance on downstream tasks?

    The answer was striking. A RoBERTa model with 125 million parameters reached 90% of full performance on MRPC (a sentence similarity benchmark) by optimizing just 200 parameters randomly projected back into the full space. The intrinsic dimension of fine-tuning, the minimum number of free parameters required to solve the task adequately, was far smaller than the model’s parameter count by orders of magnitude.

    The paper also found that pre-training implicitly minimizes intrinsic dimension, and that larger pre-trained models tend to have lower intrinsic dimension after fine-tuning. This connects to compute-optimal scaling: more training tokens on a well-pre-trained model reduces the rank of fine-tuning adjustments required for downstream tasks. The more parameters a model has, and the better its pre-training, the lower the rank of the weight change required to adapt it to a new task.

    This is the foundation LoRA stands on. The weight change during fine-tuning is low-rank in practice because the optimization problem of adapting a well-pre-trained model to a downstream task has low intrinsic dimension.

    How LoRA Works: The Mathematics

    Edward Hu, Yelong Shen, Philip Wallis, and colleagues at Microsoft published LoRA in 2022. The core idea is to represent the weight update for each linear layer as a product of two low-rank matrices rather than training the full weight matrix directly.

    For a weight matrix W of shape (d_out, d_in), standard fine-tuning trains the full update delta_W, which has d_out times d_in parameters. LoRA instead trains two matrices: A of shape (r, d_in) and B of shape (d_out, r), where r is the rank hyperparameter. The effective weight update is B times A, which has the same shape as delta_W but is parameterized by only r times (d_in plus d_out) values.

    At rank r = 8, for a typical attention weight matrix with d = 4096, this reduces trainable parameters from roughly 16 million per matrix to about 65,000: a 99.6% reduction.

    During training, the pre-trained weights W remain frozen. Only A and B are updated. During inference, A and B are multiplied together and added to W, producing a single merged weight matrix that requires no extra compute. The inference overhead of LoRA is exactly zero once the adapter is merged.

    The initialization matters. Matrix A is initialized with Gaussian random values, and B is initialized to zero. This ensures that the product B times A equals zero at training start, meaning the model begins fine-tuning from the pre-trained behavior rather than a random perturbation.

    The Scaling Factor Problem: LoRA vs rsLoRA

    The original LoRA implementation applies a scaling factor of alpha divided by r to the adapter output, where alpha is a hyperparameter typically set to twice the rank value. This scaling was introduced to make the adapter magnitude consistent across rank choices, with higher rank adapters producing smaller per-parameter updates.

    In 2023, Damjan Kalajdzievski published rsLoRA (A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA), identifying a problem with this convention. Standard LoRA’s scaling by alpha/r causes adapter signal strength to decrease as rank increases. This means that increasing rank to improve expressiveness simultaneously decreases the learning signal per parameter, creating a coupling between rank and effective learning rate that forces practitioners to re-tune alpha for each rank choice.

    rsLoRA corrects this by scaling with alpha divided by the square root of r instead of alpha divided by r. This stabilizes the adapter gradient magnitude across rank values, allowing higher ranks to produce meaningfully better results without manual re-tuning. The rsLoRA paper demonstrated that models trained with the corrected scaling factor consistently outperformed standard LoRA at the same rank, and that the benefit grew with rank, making rsLoRA especially valuable when rank needs to be high for complex tasks.

    rsLoRA is now available in the Hugging Face PEFT library. The practical implication: if you are using LoRA at ranks above 16, rsLoRA is worth enabling. The original convention works well at low ranks but underperforms at high ranks precisely when you need the most expressiveness.

    DoRA: Decomposing Magnitude and Direction

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hsuan Yang published DoRA (Weight-Decomposed Low-Rank Adaptation) in 2024, identifying a structural difference between how LoRA and full fine-tuning modify weight matrices.

    Full fine-tuning can change a weight matrix’s magnitude (how strongly it responds to inputs) and direction (which inputs it responds to) independently. LoRA, by adding a low-rank update to the full weight matrix, couples these two changes. A 2024 analysis (LoRA vs Full Fine-tuning: An Illusion of Equivalence) confirmed that LoRA and full fine-tuning learn qualitatively different weight structures even when their downstream task performance is similar, with LoRA producing weight matrices whose singular value decompositions have markedly different structure.

    DoRA decomposes the weight matrix into magnitude and direction components, then applies LoRA only to the direction component while allowing the magnitude to change freely. This gives the adapter more expressive power to replicate the learning dynamics of full fine-tuning, while keeping the trainable parameter count similar to standard LoRA.

    In the original DoRA experiments, the method consistently outperformed standard LoRA on commonsense reasoning, visual instruction tuning (LLaVA-style VLMs), and text-to-image generation benchmarks, with the gains most pronounced on complex tasks requiring structural changes to model representations. DoRA is available in the Hugging Face PEFT library as an alternative to standard LoRA and is worth evaluating when standard LoRA fails to close the gap with full fine-tuning on a specific task.

    QLoRA: Training 70B Models on a Single A100

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer at the University of Washington published QLoRA in 2023, solving a different problem than LoRA: even when you only train 0.5% of parameters, the frozen base model still occupies GPU memory.

    A LLaMA-7B model in fp16 requires 14GB for its frozen weights alone. LLaMA-13B requires 26GB. LLaMA-65B requires 130GB, which is impossible on a single consumer GPU. QLoRA addresses this by quantizing the frozen base model to 4 bits, then applying standard LoRA adapters at bf16 precision on top of the quantized base.

    The key innovation in QLoRA is not 4-bit quantization itself, but a new 4-bit data type called NF4 (NormalFloat4) designed specifically for the distribution of neural network weights. Standard int4 quantization assumes a uniform distribution across the quantization range. Neural network weights follow an approximately normal (Gaussian) distribution. NF4 spaces its 16 quantization levels at the quantiles of a standard normal distribution, minimizing expected quantization error for normally distributed weights compared to uniform int4.

    QLoRA also introduces Double Quantization (DQ), which quantizes the quantization constants themselves. Each small block of weights has its own scaling factor (a quantization constant). Standard 4-bit quantization stores these constants in fp32, which adds roughly 0.5 bits per parameter. Double Quantization applies 8-bit quantization to the quantization constants, recovering most of this overhead and reducing the average storage per parameter to approximately 4.127 bits.

    The result: LLaMA-65B, which requires approximately 130GB at fp16, fine-tunes in QLoRA at approximately 41GB, within the memory budget of a single A100 80GB GPU. The Guanaco models that the QLoRA paper used to validate the technique were competitive with ChatGPT on human evaluation benchmarks despite being trained on a single GPU over a single day.

    What Rank r to Choose

    Rank selection is the primary hyperparameter in LoRA and has no universal answer. The right value depends on the complexity of the task, the size of the base model, and the amount of training data available.

    Low ranks (r = 4 or r = 8): Appropriate for simple instruction following, style transfer, or domain adaptation tasks where the target behavior is a small perturbation of the base model’s existing capabilities. These are fast, memory-efficient, and often sufficient for single-domain specialization.

    Medium ranks (r = 16 to r = 32): The practical default for most fine-tuning tasks, including instruction tuning, chat format adaptation, and moderately complex capability extension. The original LoRA paper uses r = 4 and r = 8 for its reported results, but the broader practitioner community has found that r = 16 to r = 32 provides a better starting point for general-purpose fine-tuning.

    High ranks (r = 64 or above): Required for complex structural changes, code generation fine-tuning, or domain transfers that require substantial departure from base model behavior. At high ranks, rsLoRA’s scaling correction becomes important. Biderman et al. (2024) found that LoRA has persistent difficulty matching full fine-tuning on code generation even at r = 256, which DoRA was specifically designed to address.

    A practical heuristic from the PEFT literature: start at r = 16 with rsLoRA enabled, evaluate on a held-out validation set, and increase rank only if performance plateaus below the full fine-tuning baseline. Doubling rank from r = 16 to r = 32 roughly doubles trainable parameter count and memory for optimizer states, but often yields diminishing returns beyond r = 64 for most NLP tasks.

    Which Layers to Adapt

    LoRA can be applied to any linear layer, but the original paper and most subsequent work apply it to the attention projection matrices (query, key, value, and output projections) and, optionally, the feed-forward layer weights. Empirically, adapting the query and value matrices alone captures most of the benefit on NLP tasks. The output projection and feed-forward layers add measurable improvement on some tasks but are optional.

    For instruction tuning and conversational fine-tuning, adapting all attention projection matrices at r = 16 is the standard starting configuration. For domain-specific knowledge injection, including the feed-forward layers is often worth the additional parameter cost because factual knowledge in transformers is disproportionately stored in the MLP layers rather than attention.

    Merge at Inference: Zero Overhead

    The practical advantage most practitioners underuse is LoRA merging. After training, the adapter matrices A and B can be multiplied together and added directly to the corresponding frozen weight matrix W. The merged weight matrix W plus B times A is identical to the original matrix plus the adapter update, and it fits in the same memory footprint as the base model with no additional weights.

    After merging, inference is identical in compute and memory to running the base model. There is no adapter overhead, no conditional logic, and no additional memory allocation. This makes LoRA-fine-tuned models production-trivial to deploy: ship the merged model exactly as you would ship the base model. A merged LoRA model also pairs cleanly with speculative decoding, since the draft model needs a static target weight matrix to verify against.

    Merging also enables efficient multi-task deployment through task arithmetic. Multiple LoRA adapters trained on different tasks can be merged into a single base model simultaneously, with each adapter’s contribution scaled by a mixing coefficient. This is not perfect (tasks can interfere), but it allows a single model checkpoint to approximate the behavior of several separately fine-tuned models, which has significant implications for serving infrastructure cost.

    Where LoRA Falls Short

    The 2024 paper “LoRA vs Full Fine-tuning: An Illusion of Equivalence” (Yang et al., 2024, arXiv:2410.21228) is the most important critical analysis of LoRA published to date. The paper found that even when LoRA and full fine-tuning achieve similar downstream benchmark scores, the weight matrices they produce have structurally different singular value decompositions. Full fine-tuning tends to produce weight changes distributed across many singular directions. LoRA, by construction, concentrates weight change in r singular directions regardless of what the task requires.

    The practical implication: LoRA may solve the benchmark problem while solving it in a structurally different way than full fine-tuning, which can produce brittle behavior outside the specific distribution the fine-tuning data covered. For production deployments requiring broad generalization, this is worth knowing before committing to LoRA as the sole fine-tuning method.

    Biderman et al. (2024) found that LoRA consistently underperforms full fine-tuning on code generation even at high ranks, a finding that has been replicated. DoRA partially addresses this but has not fully closed the gap. If code generation capability is the primary objective, full fine-tuning on a model small enough to fit the compute budget is often worth pursuing over LoRA on a larger model.

    QLoRA introduces an additional source of degradation: the 4-bit quantization of the base model. NF4 is carefully designed to minimize quantization error for normal distributions, but it is still lossy compression. The Guanaco results in the original QLoRA paper show competitive performance with full-precision fine-tuning, but more recent work has found that QLoRA typically loses 1-3 perplexity points relative to the same fine-tune at bf16 precision. For tasks where perplexity differences of this magnitude are acceptable, QLoRA is the right choice. For tasks requiring maximum capability extraction from a given model, bf16 LoRA or full fine-tuning at bf16 on a smaller model may perform better.

    The Current Fine-Tuning Ecosystem

    The Hugging Face PEFT library is the standard implementation for LoRA, QLoRA, rsLoRA, and DoRA. It supports all major model families and integrates directly with the Transformers trainer, making the practical barrier to entry for parameter-efficient fine-tuning extremely low. Axolotl provides a higher-level wrapper around PEFT and Transformers that has become the dominant open-source tool for community fine-tuning of models like LLaMA-3, Mistral, and Qwen.

    The Unsloth library implements hand-written CUDA kernels for LoRA and QLoRA that reduce memory usage and training time relative to the standard PEFT implementation by roughly 30-50% through kernel fusion. For practitioners pushing the memory limits of a single GPU, Unsloth is worth evaluating before upgrading hardware.

    As base models improve in general capability, the intrinsic dimension of fine-tuning for standard tasks continues to decrease. This means that lower ranks and smaller LoRA adapters become progressively sufficient as the base model gets better at pre-training. The trend suggests that for routine instruction following and style adaptation, LoRA ranks that would have seemed too small two years ago now suffice, a practical benefit of the broader improvements in base model quality driven by frontier-scale training regimes. The same scaling pressure that drives genomic foundation models like Evo 2 to 9.3 trillion training tokens is simultaneously reducing the fine-tuning cost of smaller language models.

    For the practitioner making a concrete decision: LoRA at r = 16 with rsLoRA enabled and bf16 precision on a model that fits in memory is the right default for most instruction tuning and domain adaptation tasks. Drop to QLoRA only when the target model exceeds available memory, and accept the 1-3 point capability loss that quantization introduces as the cost of running at that scale on constrained hardware. Evaluate DoRA when standard LoRA consistently falls short of full fine-tuning performance on your specific task.

  • Speculative Decoding: How LLMs Generate 3x Faster

    Speculative Decoding: How LLMs Generate 3x Faster

    Speculative Decoding: How LLMs Generate 3x Faster

    Every token a large language model generates requires one full forward pass through hundreds of billions of parameters. On a single H100 GPU, a 70B model runs at roughly 40-60 tokens per second for an interactive user, fast enough to read but far too slow for latency-sensitive applications like real-time code completion or multi-step agents that chain dozens of model calls together.

    Speculative decoding solves this without touching model weights, without quantization, and without any sacrifice in output quality. The technique, published independently by Leviathan, Kalchbrenner, and Matias at Google Brain in 2023 and by Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper at DeepMind the same year, achieves 2-4x latency reductions that compound with other inference optimizations.

    Understanding why it works, including the rejection sampling proof that guarantees identical output distributions, is now foundational knowledge for any engineer building production systems on top of large language models.

    The Core Problem: Sequential Generation Is Memory-Bound, Not Compute-Bound

    The common explanation for why LLM inference is slow blames model size. The more precise diagnosis is that autoregressive generation is memory-bandwidth-bound, not compute-bound.

    Each decoding step transfers the entire model’s parameters from GPU high-bandwidth memory (HBM) to on-chip cache. A 70B model in fp16 occupies roughly 140GB of parameters. Every single token requires loading all of that into cache, a memory transfer that consumes nearly the same time whether the model is completing a trivial phrase or solving a complex reasoning problem. GPU compute units sit mostly idle during this transfer.

    This is the key structural insight: GPU compute is chronically underutilized during autoregressive decoding. Speculative decoding exploits that idle capacity by making multiple tokens available for parallel verification in a single forward pass.

    The Draft-Verify Architecture

    The speculative decoding loop runs two phases per iteration.

    In the draft phase, a small fast model (typically 7B parameters or fewer) generates K speculative tokens autoregressively. The draft model is cheap to run because it is small. Its job is not to be right about every token but to be right often enough to earn significant acceptance rates.

    In the verify phase, the target model runs a single forward pass that processes all K draft tokens simultaneously, treating them as a sequence rather than generating them one at a time. This looks architecturally identical to the prefill phase of normal LLM inference, and the GPU processes all K tokens in parallel, exploiting the compute capacity that autoregressive decoding wastes.

    If the target model agrees with the draft on the first k tokens (where k is between 1 and K), those k tokens are accepted and appended. When the target model rejects a draft token at position j, it substitutes its own sample for that position and discards subsequent draft tokens. Either way, each iteration produces at least one token, and in high-acceptance-rate scenarios, consistently produces four to five.

    The expected number of tokens accepted per iteration is called the average acceptance length (tau). If tau equals 3.5, speculative decoding needs roughly 1/3.5 as many full target-model passes as autoregressive decoding. That ratio maps directly to latency reduction.

    The Rejection Sampling Math: Why Output Quality Is Preserved

    The non-obvious component of speculative decoding is the sampling correction step. Accepting tokens from a draft model’s distribution could bias output toward that smaller model. The rejection sampling algorithm prevents this precisely.

    Let p(x) denote the target model’s probability distribution over vocabulary tokens at a given position, and q(x) the draft model’s distribution. For each draft token x drawn from q, the algorithm accepts it with probability:

    min(1, p(x) / q(x))

    When a token is rejected, the algorithm samples a corrected replacement from:

    normalize(max(0, p(x) – q(x)))

    Leviathan et al. and Chen et al. both prove that this procedure produces samples distributed exactly as p(x), the target model’s distribution, regardless of what q(x) looks like. Crucially, this is not an approximation. When draft and target agree perfectly, all draft tokens are accepted with probability 1. When they disagree completely, the algorithm falls back to exactly the target model’s output. The proof holds for any draft model and any target model over any shared vocabulary.

    This mathematical guarantee is what makes speculative decoding production-safe. It is not model distillation, not a lower-quality approximation, and not model compression. The output distribution is provably identical to running the target model alone.

    EAGLE: Feature-Level Drafting

    The original speculative decoding papers use a separately trained, smaller version of the target model as the draft model. This works but has a fundamental limitation: the draft model predicts tokens from scratch, using the same input representations as the target model.

    EAGLE, published by Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang (Peking University and Microsoft Research, 2024), rethought the draft model by observing that the target LLM’s second-to-last layer features carry far more predictive information about the next token than raw input embeddings. EAGLE trains a lightweight autoregressive head that predicts future feature vectors directly, reasoning in the target model’s representation space rather than from input tokens.

    This change raises acceptance rates substantially. The draft model makes fewer errors because it works from richer, higher-level representations. EAGLE achieved average acceptance lengths of three to four tokens across Vicuna and LLaMA-2 series evaluations, compared to two to three for standard speculative sampling with size-comparable draft models.

    EAGLE-2: Context-Dependent Draft Trees

    EAGLE used a static draft tree, the same K speculative tokens allocated in the same structure regardless of generation context. This assumption has a subtle flaw that the follow-up work addressed directly.

    Acceptance rates for draft tokens are not only position-dependent but strongly context-dependent, as the EAGLE-2 paper (Li, Wei, Zhang, Zhang, EMNLP 2024) demonstrates experimentally. When the target model completes a deterministic phrase (a common code pattern, a well-known mathematical identity, a numerical sequence), acceptance rates push toward 100%. When the model generates genuinely uncertain content, acceptance rates fall below 40%.

    EAGLE-2 exploits this by dynamically sizing the draft tree based on the draft model’s own confidence scores. The paper shows that EAGLE’s draft model is well-calibrated: its confidence approximates actual acceptance rates with small errors. EAGLE-2 uses this calibration to expand the draft tree during high-confidence contexts and shrink it when uncertainty is high, avoiding wasted compute on branches unlikely to be accepted.

    Across evaluations on three LLM series (Vicuna, LLaMA-2, LLaMA-3) and six generation tasks, EAGLE-2 achieved speedup ratios of 3.05x to 4.26x, which is 20 to 40% faster than EAGLE-1, which was already faster than standard speculative sampling.

    Self-Speculative Decoding: No Separate Draft Model Required

    A significant practical barrier to deploying speculative decoding is the draft model itself. You need a compatible smaller model, which requires additional storage, memory management, and operational complexity. Self-speculative decoding removes this requirement.

    The insight is that many LLMs produce reasonable predictions long before the final transformer layer. Self-speculative decoding uses the target model’s own early layers as the draft model, generating speculative tokens via early exit from the forward pass, then verifying with the full model in the standard way.

    Meta AI’s LayerSkip (2024) formalized this approach by training models with early-exit regularization, enabling 1.3-2.5x speedups without a separate draft model. The tradeoff is that self-speculative methods achieve lower acceptance rates than EAGLE-class approaches, because early-exit distributions differ more from final-layer distributions than a purpose-trained draft model would. For teams unable to deploy an additional draft model, self-speculative decoding is the practical path to inference acceleration with a single model binary.

    Verification as Prefill: The GPU Utilization Insight

    The mechanism that makes speculative decoding computationally efficient is worth understanding precisely. Standard prefill (processing the input prompt before generation begins) is compute-bound on modern hardware, the batch of prompt tokens can saturate GPU compute units. Autoregressive decoding is memory-bandwidth-bound, one token at a time, underutilizing compute. Speculative decoding converts the verification step into a prefill-like operation where all K draft tokens are processed in a single parallel forward pass.

    The verification step is therefore compute-bound, not memory-bandwidth-bound. The GPU runs at closer to peak utilization during verification than during standard decoding. This is why the throughput gains are real rather than illusory: the hardware is doing more useful compute per unit wall time.

    Xiaoxuan Liu’s 2025 UC Berkeley PhD thesis on efficient LLM systems with speculative decoding formalizes this framing, recharacterizing speculative decoding as a verification efficiency problem rather than a drafting problem, and identifying adaptive and selective verification as the primary surface for future improvement.

    Apple’s Mirror-SD: Breaking the Latency-Acceptance Tradeoff

    Apple published Mirror Speculative Decoding (Bhendawade, Nishu, Kundu, Bartels, Cho, Belousova, December 2025) to address a constraint that applies to all existing approaches: increasing draft size K raises acceptance rates but also adds latency overhead in the autoregressive draft phase. EAGLE-2 partially addresses this through dynamic tree sizing, but the tradeoff remains.

    Mirror-SD introduces Principled Coarse-Graining (PCG), which verifies proposals at a coarser granularity. Rather than verifying tokens individually, PCG groups draft tokens and uses the target model’s distribution over groups to accept or reject cohesively. This enables larger effective draft sizes without proportional overhead in the draft phase, breaking the latency-acceptance frontier that earlier methods traced.

    The research was validated on Apple Silicon, reflecting practical interest in speculative decoding for on-device inference, a setting where memory bandwidth constraints are more severe than on data center GPUs, making the architectural benefit of speculative decoding correspondingly larger.

    Choosing the Right Draft Model

    The compatibility requirement is the binding constraint in draft model selection. Draft and target models must share the same tokenizer vocabulary, token indices must correspond one-to-one. In practice, this confines speculative decoding to within-family pairings: a LLaMA-3-8B model drafting for a LLaMA-3-70B target, or a Vicuna-7B drafting for a Vicuna-33B target.

    The parameter ratio matters too. A draft model that is too large adds latency overhead in the draft phase that consumes the gains from fewer target model calls. A draft model that is too small produces low acceptance rates. Empirically, a 10:1 to 15:1 parameter ratio works well across most workloads. A 7B draft for a 70B target is the canonical configuration.

    For teams using EAGLE-style feature-level drafting, the draft head is far smaller than a standard draft model, often just one or two transformer layers trained on top of the target model’s existing representations. This is why EAGLE-class methods consistently outperform standard speculative sampling in published benchmarks: the draft head benefits from the same representations that the target model uses, rather than learning independently from scratch.

    Evaluating Your Workload Before Deployment

    The most common mistake when adopting speculative decoding is using published benchmark speedup numbers without measuring acceptance rates on a representative production sample. Average acceptance length tau is strongly task-dependent. The same model pair might achieve tau = 4.8 on code completion tasks and tau = 1.9 on open-ended creative generation. Published 3-4x speedup figures typically come from mixed benchmarks that average these cases, and may substantially overstate gains for workloads weighted toward high-uncertainty generation.

    The correct evaluation sequence: collect a representative sample of production inputs, run speculative decoding with the candidate draft model, measure average acceptance length, compare wall-clock latency at your target batch size, and benchmark against alternative optimizations including quantization and batching changes.

    vLLM logs draft acceptance rates per request, which makes this analysis straightforward. A practical threshold: if tau consistently falls below 2.0 on your production distribution, the overhead of managing the draft model likely outweighs the latency savings. At tau above 3.0, speculative decoding outperforms most alternative single-optimization approaches for interactive latency reduction.

    Batch Inference: Where Speculative Decoding Does Not Help

    Batch inference is where speculative decoding consistently underperforms. In batched serving, multiple requests are processed simultaneously, and the batch can saturate GPU compute units, making the operation compute-bound. Speculative decoding addresses memory-bandwidth bottlenecks, which are not the binding constraint in batched settings.

    NVIDIA’s reported 3.6x throughput improvement on H200 GPUs applies specifically to single-request low-latency settings. For high-throughput batch serving, continuous batching and PagedAttention, vLLM’s core techniques, remain more impactful than speculative decoding. The two optimizations address fundamentally different bottlenecks: speculative decoding targets per-request latency, PagedAttention targets KV cache memory efficiency at scale.

    Foundation models used for batch inference over large datasets, including the class of genomic foundation models like Arc Institute’s Evo 2, which process millions of sequences, are better served by batching optimizations than by speculative decoding. The latency reduction that speculative decoding provides is most valuable in interactive deployments where per-request response time drives user experience.

    Production Implementations in 2025-2026

    All three major open-source serving frameworks now include native speculative decoding support. vLLM supports configurable draft model selection and dynamic speculation length, with separate KV cache management for draft and target models. TensorRT-LLM from NVIDIA includes CUDA-optimized draft-verify kernels with reported 3.6x throughput improvement on H200 hardware for interactive use cases. SGLang supports EAGLE-style feature-level drafting, which achieves higher acceptance rates than size-matched conventional draft models in most benchmarks.

    The combination of 4-bit quantization via QLoRA and EAGLE-2 speculative decoding can achieve 5-7x latency reduction over a full-precision autoregressive baseline, at some cost to output quality from quantization. A quantized draft model on a quantized target model still benefits from the draft-verify architecture, and the acceptance rate analysis applies to quantized distributions without modification.

    Limitations and Open Questions

    The acceptance rate ceiling is the technique’s fundamental limitation. Even EAGLE-2’s best results show tau well below K. You cannot accept all draft tokens in the general case, because the draft model is not the target model. Tasks with high per-token uncertainty will always see low acceptance rates, and no amount of draft model improvement can eliminate this without making the draft model equivalent to the target model.

    The draft model maintenance burden is real in production. When the target model is updated (new fine-tune, new instruction tuning, new version), the draft model may need retraining to maintain compatibility. Teams that deploy speculative decoding inherit a two-model deployment and update pipeline rather than a one-model pipeline.

    Self-speculative decoding solves the maintenance problem but achieves lower speedups. EAGLE solves the speedup problem but requires a separately managed draft artifact. No current approach eliminates both constraints simultaneously, though Mirror-SD’s direction (improving the efficiency-accuracy frontier through principled verification algorithms rather than better drafting) may eventually reduce the dependence on high-quality draft models altogether.

    For practitioners today, speculative decoding is the correct first optimization to evaluate for interactive LLM deployments where per-request latency is the primary constraint. The technique is mature, production-ready, framework-supported, and backed by a proof that output quality is preserved exactly. The question is not whether to use it but which variant fits the deployment constraints, and that depends on whether maintaining a draft model is acceptable in your infrastructure. The inference-cost argument driving compute-optimal training choices applies equally here: smaller inference-optimal models are easier draft targets and achieve higher acceptance rates with proportionally smaller draft heads.

  • LLMs in Veterinary Clinical Practice: What the Evidence Actually Shows

    LLMs in Veterinary Clinical Practice: What the Evidence Actually Shows

    LLMs in Veterinary Clinical Practice: What the Evidence Actually Shows

    ChatGPT-4.5 scored 90% on feline eye disease cases versus 96.7% for experienced veterinary ophthalmologists and significantly outperformed novices scoring 56-67%. That single data point from a 2025 Veterinary Ophthalmology study contains most of what is actually known about LLM clinical utility in veterinary practice: competitive with novice clinicians on well-defined diagnostic tasks, below expert performance, and not tested on drug dosing decisions where the failure modes are more dangerous.

    Where LLMs Add Value in Veterinary Practice

    Published studies from 2024-2026 document LLM performance on veterinary clinical vignettes for companion animals at levels competitive with veterinary students and recent graduates. The tasks where LLMs perform best are differential diagnosis generation, where broad knowledge coverage matters, and client communication drafting, where fluency matters more than precision. In clinical decision support roles where a veterinarian reviews AI suggestions before acting, the performance gap between LLMs and experienced clinicians is less consequential.

    The Drug Dosing Risk

    Veterinary pharmacology is complicated by the diversity of species in practice. Drug dosing for dogs and cats is moderately well represented in LLM training data. Drug dosing for exotic species (reptiles, birds, small mammals) is sparsely represented, and the consequence of errors is significant: therapeutic windows are narrow and interspecies variation is extreme. A 2025 Ghent University study found that LLM drug dosing recommendations for avian and reptile patients were frequently outside safe therapeutic ranges. The error rate on exotic species dosing is a serious limitation for any veterinary LLM deployment.

    Limitations

    No veterinary LLM study has measured patient outcomes. All evidence is on diagnostic accuracy against held-out cases or clinical vignettes, not on whether LLM-assisted care produces better outcomes than standard care. The studies that exist are predominantly single-center, single-species, and single-task.

    Related coverage: AI in Veterinary Medicine: What the Clinical Evidence Actually Shows | AI-Assisted Zoonotic Disease Detection: From SARS to H5N1 | One Health and Machine Learning: How AI Bridges Human and Animal Disease Surveillance

    Primary sources: 2025 Veterinary Ophthalmology LLM study; Ghent University exotic species dosing study 2025.

  • AI-Assisted Zoonotic Disease Detection: From SARS to H5N1

    AI-Assisted Zoonotic Disease Detection: From SARS to H5N1

    AI-Assisted Zoonotic Disease Detection: From SARS to H5N1

    Zoonotic diseases, pathogens that jump from animals to humans, account for approximately 60% of all known infectious diseases and 75% of emerging infectious diseases. AI-assisted surveillance changes the speed and sensitivity of detection by integrating data sources that traditional epidemiological systems treat separately.

    How AI Zoonotic Surveillance Works

    Modern AI zoonotic surveillance systems ingest electronic health records from both human and animal health systems, news and social media for syndromic signals, genomic sequence data for pathogen characterization, satellite imagery for land use change detection, and environmental sensor data for vector presence. Machine learning models trained on historical outbreak data identify spatial-temporal patterns that precede confirmed human cases by days to weeks. During the H5N1 emergence in US dairy cattle in 2024, AI genomic surveillance flagged unusual clade 2.3.4.4b patterns before traditional sequence-based alerts fired.

    The SARS Retrospective

    Retrospective analysis of SARS-CoV-1 data using modern AI surveillance methods shows that algorithm-based systems analyzing Hong Kong hospital admission patterns would have triggered alerts 8 to 12 days before the WHO notification in February 2003. The value of that lead time, in terms of border measures and healthcare surge preparation, is estimated at billions of dollars and thousands of preventable cases in the first wave.

    What AI Cannot Do

    AI surveillance flags anomalies. It cannot determine whether an anomaly is a genuine emerging zoonosis or a local hospital capacity issue, a drought-related enteric disease cluster, or data entry error. Every AI alert requires epidemiological investigation. False positive burden matters: if AI systems generate too many false alerts, public health systems stop responding to them. The H5N1 dairy cattle situation demonstrated that AI surveillance and human epidemiology still operate in largely separate institutional silos.

    Related coverage: AI in Veterinary Medicine: What the Clinical Evidence Actually Shows | One Health and Machine Learning: How AI Bridges Human and Animal Disease Surveillance | LLMs in Veterinary Clinical Practice: What the Evidence Actually Shows

    Primary sources: Li et al. 2025 Biomedical Journal review; GenBank clade 2.3.4.4b surveillance data; retrospective SARS analysis, PubMed indexed.

  • One Health and Machine Learning: How AI Bridges Human and Animal Disease Surveillance

    One Health and Machine Learning: How AI Bridges Human and Animal Disease Surveillance

    One Health and Machine Learning: How AI Bridges Human and Animal Disease Surveillance
    3 domains
    One Health integrates human, animal, and environmental disease surveillance
    60%
    of emerging infectious diseases in humans originate in animals
    EHR+social
    multi-source data integration improves influenza forecasting accuracy
    AI4MPOX-SN
    Senegal’s One Health AI initiative integrating human-animal-environment data for mpox prediction

    Approximately 60 percent of emerging infectious diseases in humans originate in animals. HIV, Ebola, SARS, MERS, and H5N1 influenza all crossed from animal reservoirs to human populations. The One Health framework, endorsed by the WHO, FAO, and UNEP jointly, recognizes that human health, animal health, and environmental health are interdependent and must be monitored together. Machine learning is now being applied to the data integration problem that One Health surveillance has always faced: combining heterogeneous datasets from human clinical systems, veterinary surveillance networks, wildlife monitoring programs, and environmental sensors into a coherent early warning system.

    The November 2025 review in Biomedical Journal by Li et al. from Chang Gung Memorial Hospital and Boston Children’s Hospital documented the current state of AI integration in infection surveillance. The key finding is that integrating social media data improves influenza forecasting accuracy, while wearable technologies enable real-time monitoring of infection dynamics that traditional sentinel surveillance systems cannot capture.

    What One Health Surveillance Actually Collects

    Traditional infection surveillance collects data from one domain at a time. Human surveillance systems collect case reports, laboratory-confirmed diagnoses, and sentinel physician networks. Veterinary surveillance collects farm case reports, wildlife sampling data, and abattoir inspection results. Environmental surveillance collects water quality monitoring, air sampling, and climate data that affects vector ranges. These systems operate in separate institutional frameworks with different data standards, different reporting timelines, and different organizational authorities. Machine learning integration builds models that process all three data streams simultaneously, identifying correlations across the human-animal-environment interface that single-domain surveillance would miss.

    The Senegal AI4MPOX-SN Initiative

    The February 2026 Frontiers in Public Health paper by Faye et al. from Cheikh Anta DIOP University in Dakar documented One Health AI applied to mpox surveillance in Senegal. By late October 2025, Senegal had reported seven mpox cases, all in Dakar, following population movement along the Dakar-Thiès-Diourbel corridor. The AI4MPOX-SN initiative proposes to integrate human-animal-environment data using AI for anomaly detection and predictive modeling. The paper documents that the existing system improved reporting and geolocation, but faces challenges including underreporting in rural areas and gaps in data interoperability.

    AI for Mpox in Africa

    The September 2025 Journal of Virological Methods review by Olawade et al. from the University of East London documented AI applications for mpox control across Africa: machine learning for early detection, automated contact tracing through mobile data, and optimization of public health messages for specific communities. The challenges identified apply broadly: limited digital infrastructure, data quality issues in fragmented surveillance systems, and ethical concerns about privacy.

    Wearables and Real-Time Infection Dynamics

    Consumer wearables continuously measure resting heart rate, heart rate variability, skin temperature, and respiratory rate. These parameters change measurably during the early phase of respiratory infection, before the infected person develops symptoms or seeks clinical care. Studies using Fitbit and Apple Watch data demonstrated that elevated resting heart rate in the days before symptom onset is a statistically significant predictor of influenza-like illness. At population scale, aggregate wearable signals can detect rising infection rates faster than clinical case reporting systems.

    Climate Change as a One Health Driver

    Rising temperatures expand the geographic range of vector species including Aedes mosquitoes (dengue, Zika, chikungunya) and Ixodes ticks (Lyme disease, tick-borne encephalitis). Machine learning models incorporating climate variables alongside epidemiological data have shown improved predictive accuracy for vector-borne disease risk, enabling health systems to prepare mosquito control campaigns before outbreaks begin.

    What Happens Next

    The key performance metric is time-to-detection: how many days earlier does integrated AI surveillance detect an emerging outbreak compared to traditional single-domain systems. Whether the AI4MPOX-SN initiative and similar programs produce measurable improvements in outbreak detection speed will become clear in the next three to five years as the systems accumulate operational data.

    Primary sources retrieved from PubMed: Li JH et al., “Artificial intelligence in infection surveillance,” Biomedical Journal 2025;49(2):100929 (PMID: 41205676); Faye SLB et al., Front Public Health 2026;14:1742888 (PMID: 41710307); Olawade DB et al., J Virol Methods 2025;339:115270 (PMID: 41005719).

    Related reading: AI-Assisted Zoonotic Disease Detection: From SARS to H5N1 | AI in Veterinary Medicine | LLMs in Veterinary Clinical Practice | What ASL-3 Actually Means

  • Generative AI for Small Molecule Drug Discovery: How It Works and What the Evidence Shows

    Generative AI for Small Molecule Drug Discovery: How It Works and What the Evidence Shows

    Generative AI for Small Molecule Drug Discovery: How It Works and What the Evidence Shows

    Generative AI for small molecule drug discovery covers a specific set of architectures applied to a specific problem: generating novel molecular structures likely to have desired properties against a biological target. The approaches include variational autoencoders, generative adversarial networks, and diffusion models applied to molecular graphs, SMILES strings, or 3D atomic coordinates. Published results show genuine capability improvements over traditional virtual screening, but no generative AI-designed molecule has yet completed Phase III clinical trials.

    How Generative Molecular AI Works

    VAE-based molecular generation encodes molecules into a continuous latent space, then decodes sampled points back to molecular structures. The latent space can be navigated toward regions with desired predicted properties using gradient-based optimization or Bayesian optimization. GAN-based generation trains a generator to produce molecules that fool a discriminator trained to distinguish real from generated molecules, optimizing simultaneously toward chemical validity and desired property predictions. Diffusion models for molecules, including DiffSBDD and TargetDiff, generate atomic coordinates conditioned on protein binding site geometry, placing atoms iteratively in 3D space by reversing a diffusion process.

    The Validation Challenge

    Virtual screening hit rates using generative AI are substantially higher than traditional docking-based screening in several published benchmarks. Wang et al. 2024 reported a 75% hit rate for AI-designed compounds against a target using ML-based virtual screening across a 106-million-compound library. That 75% figure applies to in vitro binding confirmation, not in vivo efficacy. The gap between in vitro binding and clinical efficacy is where most drug discovery projects fail, and generative AI has not yet demonstrated a systematic advantage in predicting the ADMET properties that determine whether a promising binder becomes a viable drug candidate.

    The Distribution Shift Problem

    Generative AI models are trained on known bioactive molecules, which means they are optimized to produce molecules that resemble the known chemical space of drugs. Novel targets with no known binders require generation outside the training distribution, where model reliability degrades. The targets easiest to hit with generative AI are the ones with the most existing data, which are often the most explored targets with the most existing drugs.

    Limitations

    As of early 2026, no AI-designed drug has completed Phase III. Insilico Medicine’s INS018_055 for IPF reached Phase II with positive interim results. Exscientia and Recursion have compounds in Phase I/II. The clinical evidence that generative AI improves drug discovery outcomes over existing rational design approaches does not yet exist at the Phase III level.

    Related coverage: AI-Driven ADMET Prediction: What the Blind Challenge Results Actually Show | AlphaFold 3 in Drug Discovery: Where It Works and Where It Fails | ESM3: The Protein Language Model That Unifies Sequence, Structure and Function

    Primary sources: Wang et al. 2024 virtual screening benchmark; Insilico Medicine Phase II results; Schneider P et al., Nature Reviews Drug Discovery 2020.

  • AI in Digital Pathology: What Computational Pathology Can and Cannot See

    AI in Digital Pathology: What Computational Pathology Can and Cannot See

    AI in Digital Pathology: What Computational Pathology Can and Cannot See

    An NIH multi-institution study in Lancet Oncology classified 52 CNS tumor types from tissue images at 80% accuracy across 5,516 test samples. A simultaneous Cancer Science paper documented that label noise in pathologist annotations causes AI pathology accuracy to be systematically overstated. These two findings define the current state of computational pathology.

    The NIH CNS Tumor Study

    Tio et al. 2024 trained a deep learning classifier on digitized H&E stained slides from 19 institutions to distinguish 52 CNS tumor subtypes. The 80% accuracy figure represents performance on held-out cases across three continents. For 12 of the 52 subtypes, accuracy exceeded 90%. For rare subtypes with fewer than 50 training cases, accuracy fell significantly, reflecting the training data bottleneck that affects all rare pathology classification tasks.

    The Label Noise Problem

    Computational pathology models learn from pathologist annotations. When those annotations contain errors, the models inherit them. The Cancer Science paper by Komura et al. 2024 documented systematic overstatement of AI pathology accuracy: because model performance is typically measured against the same annotations used for training, disagreements between the model and ground truth are counted as model errors even when the model is correct and the annotation is wrong. The true accuracy ceiling for any pathology AI system is bounded by inter-pathologist agreement on the training labels.

    Limitations

    Staining variability across institutions, scanner hardware differences, and tissue processing protocols all create distribution shift. Computational pathology models trained at academic centers with standardized protocols may underperform at community hospitals.

    Related coverage: FDA Clearance for AI Medical Devices: What 510(k), De Novo, and PMA Mean | AI in Radiology: Three Phases and What the Clinical Evidence Shows | AI-Driven ADMET Prediction: What the Blind Challenge Results Show

    Primary sources: Tio et al., Lancet Oncology 2024; Komura et al., Cancer Science 2024.

  • LLMs in Veterinary Clinical Practice: What the Evidence Actually Shows

    FDA Clearance for AI Medical Devices: What 510(k), De Novo, and PMA Actually Mean

    LLMs in Veterinary Clinical Practice: What the Evidence Actually Shows

    The FDA had cleared more than 700 AI-enabled medical devices through three distinct pathways as of 2025. The regulatory mechanism determines not only whether a device can be sold but what clinical claims its manufacturer can make and how much post-market surveillance is required. Understanding the difference between 510(k), De Novo, and PMA matters for anyone evaluating AI medical devices for clinical deployment.

    The 510(k) Pathway

    The 510(k) pathway requires a manufacturer to demonstrate substantial equivalence to a legally marketed predicate device. Most AI medical device clearances in the United States come through 510(k). The pathway does not require clinical trials. It requires a comparison to an existing device and evidence that the new device performs at least as well as the predicate on specified performance metrics. For AI radiology tools, the predicate is often an earlier version of the same tool or a non-AI decision-support system.

    The De Novo Pathway

    When no suitable predicate exists, manufacturers can request De Novo classification. The FDA evaluates the novel device against a risk-based standard and, if granted, the De Novo decision itself becomes a predicate for future 510(k) submissions. The first AI device cleared through De Novo for a given indication creates the template for subsequent clearances in that category.

    The PMA Pathway

    Premarket Approval applies to the highest-risk devices and requires valid scientific evidence, typically clinical trials, demonstrating reasonable assurance of safety and effectiveness. PMA is rare for AI medical devices. The clinical trial requirement is expensive and slow relative to 510(k). Most AI device manufacturers structure their submissions to qualify for 510(k) rather than PMA.

    The EU AI Act Comparison

    The EU AI Act classifies most medical AI as high-risk and requires conformity assessment, registration in an EU database, and post-market monitoring plans. The March 2026 European Radiology review documented convergence between FDA, EU, and China NMPA frameworks on core requirements including performance testing, bias evaluation, and transparency, with meaningful differences in stringency and enforcement.

    Limitations

    FDA clearance establishes safety and effectiveness relative to a predicate, not superiority over current standard of care. A cleared AI device may be no better than existing tools; clearance does not mean clinically beneficial. Post-market surveillance requirements for cleared AI devices are less rigorous than those applied to high-risk drugs.

    Related coverage: AI in Radiology: Three Phases and What the Clinical Evidence Shows | AI in Digital Pathology: What Computational Pathology Can and Cannot See | Poisoning the Medical Brain: RAG Attacks and Security in Clinical AI

    Primary sources: FDA AI/ML medical device database; March 2026 European Radiology review of international AI medical device frameworks.

  • One Health and Machine Learning: How AI Bridges Human and Animal Disease Surveillance

    AI-Driven ADMET Prediction: What the Blind Challenge Results Actually Show

    One Health and Machine Learning: How AI Bridges Human and Animal Disease Surveillance

    Approximately 90% of drug candidates entering clinical trials fail, primarily due to inadequate pharmacokinetics and unacceptable toxicity. The 2025 OpenADMET blind challenge had 65+ teams submit predictions before experimental results were revealed. Deep learning beat classical methods for ADME prediction; classical methods remained competitive for potency.

    The AI-PBPK Platform

    Wang et al. at Macau University published an AI-PBPK platform predicting eight molecular properties from structure alone and feeding them into a physiologically-based PK model. Validated against 677 human PK datasets, most AUC predictions fell within 2-3x fold error of experimental data, acceptable for early-stage decision-making.

    Where AI ADMET Fails

    Idiosyncratic toxicity reactions depend on individual immune variability that cannot be predicted from molecular structure. Potency prediction still favors classical methods for many targets.

    Related coverage: Generative AI for Small Molecule Drug Discovery | AlphaFold 3 in Drug Discovery: Where It Works and Where It Fails

    Primary sources: Fischer Y et al., J Chem Inf Model 2025;65(24); Wang W et al., Clin Pharmacol Ther 2025;118(4).

  • Poisoning the Medical Brain: RAG Attacks and Security in Clinical AI Systems

    Poisoning the Medical Brain: RAG Attacks and Security in Clinical AI Systems

    Poisoning the Medical Brain: RAG Attacks and Security in Clinical AI Systems

    Clinical AI systems built on retrieval-augmented generation face a security threat that does not require compromising model weights. Poisoning the knowledge base redirects outputs at inference time without touching the model itself.

    How RAG Poisoning Works

    A RAG system retrieves documents from a knowledge base, injects them into the model’s context, and generates a response grounded in the retrieved content. The attack vector is the knowledge base itself. An adversary who can insert documents into the knowledge base controls what the model retrieves for targeted queries. A poisoned document that appears in the top-k retrieval results will be treated as authoritative source material, because the model has no mechanism to distinguish a poisoned document from a legitimate one. It processes both as retrieved context.

    In clinical EHR systems, the attack surface includes any pathway that allows external data to enter the knowledge base: imported referral letters, patient-submitted documents, third-party lab results, and scanned records. Each is a potential injection point. A document containing instructions that look like clinical content but direct the model to alter its recommendations will be retrieved, processed, and acted upon without any indication to the clinical user that the output was adversarially modified.

    The Security-Utility Problem in Clinical RAG

    Filtering retrieved content for adversarial instructions before it reaches the model faces the same fundamental problem as direct prompt injection filtering: detecting that a piece of text is an instruction requires the same semantic understanding that makes the model vulnerable. Aggressive filtering produces false positives that block legitimate clinical content. Permissive filtering leaves the injection surface open.

    Access control on the knowledge base is the most effective mitigation at the retrieval layer: ensuring that only authorized documents can enter the knowledge base, and that retrieval is scoped to documents the querying user is authorized to access. This prevents cross-patient poisoning attacks and limits the blast radius of any individual poisoned document to users authorized to retrieve it.

    The RAG poisoning attack surface maps directly to two entries in the OWASP LLM Top 10 for 2025: LLM08 (Vector and Embedding Weaknesses) and LLM01 (Prompt Injection). OWASP’s 2025 update added LLM08 specifically because RAG architectures became the dominant deployment pattern and the associated attack surfaces had not been formally catalogued before. The poisoned document in a RAG knowledge base is the most common vector for indirect prompt injection in production systems: the attacker does not interact with the LLM directly, but places instructions in content the system will retrieve and process. For the empirical evidence on which defenses reduce injection success rates at scale, including session-level detection and output-level auditing, see the Gandalf the Red analysis.

    Related coverage: Prompt Injection Succeeds 94% of the Time Against Clinical LLMs | FDA Clearance for AI Medical Devices

    Primary sources: Patel SB and Lam K, JAMA Network Open 2024. Zou et al., arXiv 2402.07927.

  • RFdiffusion and ProteinMPNN: How AI Now Designs Proteins From Scratch

    RFdiffusion and ProteinMPNN: How AI Now Designs Proteins From Scratch

    RFdiffusion and ProteinMPNN: How AI Now Designs Proteins From Scratch

    RFdiffusion generates protein backbones. ProteinMPNN designs the amino acid sequences that fold into those backbones. Together, the two tools constitute the first genuinely useful pipeline for de novo protein design at scale. Published in Nature in 2022 and 2023 respectively by teams at the Baker Lab and University of Washington, both tools are open-source and have been applied to drug discovery, enzyme engineering, and vaccine antigen design.

    How RFdiffusion Works

    RFdiffusion adapts the diffusion process used in image generation to protein backbone geometry. Starting from random atomic noise, the model iteratively denoises a cloud of Calpha coordinates toward a physically reasonable protein backbone, conditioned on any structural or functional constraints specified by the user. Conditioning can specify binding partners, active site geometries, symmetry requirements for multimers, or target binding interfaces. The model was trained on protein structures from the PDB using a denoising score matching objective.

    How ProteinMPNN Works

    Given a backbone geometry from RFdiffusion or any other source, ProteinMPNN performs inverse folding: it predicts the amino acid sequence most likely to adopt that backbone conformation when folded. The model was trained to predict the sequence of a protein given its backbone coordinates and a masked or alternative sequence context, using a graph neural network architecture that encodes backbone geometry as a set of distance and angle features.

    The Influenza Binding Interface Result

    A 2023 Science paper from the Baker Lab designed binders to the influenza hemagglutinin stem region using RFdiffusion plus ProteinMPNN. The designed binders achieved sub-Angstrom backbone RMSD to the computational design at crystal structure determination. The binding affinity was in the nanomolar range. This was the first demonstration of fully computational protein design achieving functional binders to a validated drug target without any experimental optimization cycles after the initial computational design.

    Limitations and Dual-Use Concerns

    RFdiffusion and ProteinMPNN together achieve roughly 1.5% success rates on novel target binder design from scratch, meaning 98.5% of designed sequences fail experimental characterization. They are not a replacement for experimental protein engineering. The dual-use concern is real: the same pipeline that designs therapeutic proteins can design novel proteins with other functions, including potential toxins with novel sequences that evade biosecurity screening.

    Related coverage: ESM3: The Protein Language Model That Unifies Sequence, Structure and Function | AlphaFold 3 in Drug Discovery: Where It Works and Where It Fails | How Protein Language Models Learned to Design Dangerous Proteins

    Primary sources: Watson JL et al., Science 2023 (RFdiffusion); Dauparas J et al., Science 2022 (ProteinMPNN).

  • What ASL-3 Actually Means: Anthropic’s Biorisk Threshold Explained

    What ASL-3 Actually Means: Anthropic’s Biorisk Threshold Explained

    What ASL-3 Actually Means: Anthropic’s Biorisk Threshold Explained
    ASL-3
    level at which Anthropic says models could provide serious uplift on bioweapons
    4 labs
    Anthropic, OpenAI, Google DeepMind, xAI with frontier safety eval commitments
    VCT
    Virology Capabilities Test, Anthropic’s red team benchmark for bioweapon uplift
    2x
    ASL-3 trigger: model doubles the number of people who could create mass-casualty bio threat

    Anthropic’s Responsible Scaling Policy defines four AI Safety Levels, with ASL-3 being the threshold at which the company says its models could provide serious uplift to someone attempting to create biological, chemical, nuclear, or radiological weapons with the potential for mass casualties. The company’s stated commitment is to pause deployment and restrict access to models that reach ASL-3 until specified safety measures are in place. As of the Claude 3 and Claude 4 generations, Anthropic has assessed its models as ASL-2: more capable than a Google search for harmful information but not yet providing the kind of end-to-end synthesis-level uplift that would constitute ASL-3.

    The Virology Capabilities Test

    Anthropic uses the Virology Capabilities Test as part of its ASL evaluation process for biological risks. The VCT is a red-team benchmark assessing how much uplift an LLM provides to someone attempting tasks in the bioweapon creation pathway: pathogen identification, acquisition, enhancement of transmissibility or lethality, and weaponization. The specific questions, scoring methodology, and threshold score that would trigger ASL-3 designation are internal to Anthropic. External researchers cannot independently verify whether a model passes or fails the VCT.

    What the Scale AI Study Found

    The February 2025 study from Scale AI and SecureBio (arXiv 2602.23329) provided the most detailed public empirical data on LLM bioweapon uplift to date. The study recruited biology experts and novices, had them attempt tasks relevant to bioweapon creation with and without LLM assistance, and measured the gap. The headline finding: LLM assistance gave novices approximately 4x uplift on the biological tasks tested. Whether that 4x figure constitutes ASL-3-level uplift depends on interpretation of the threshold that Anthropic has not made fully public.

    The Limitations of Self-Assessment

    The RSP framework is self-regulatory. Anthropic evaluates its own models against its own thresholds using its own methodology and makes its own determination about whether to deploy. There is no independent third-party verification of the VCT results, no government audit of the threshold-setting methodology, and no legal consequence for deploying a model that fails internal safety evaluations. All major frontier lab safety frameworks are currently self-regulatory. The question is whether voluntary frameworks with self-assessment are adequate given the stakes, or whether bioweapon uplift risk requires the kind of independent verification that the Nuclear Regulatory Commission applies to nuclear facilities.

    Related reading: LLMs Give Novice Biologists 4x Uplift on Dangerous Tasks | Protein Language Models and Biosecurity Dual-Use Risk | DNA Synthesis Screening Cannot Keep Up With AI-Designed Sequences

    Primary sources: Anthropic Responsible Scaling Policy (September 2023, updated 2024); Mouton CA et al. (Scale AI/SecureBio), arXiv:2602.23329 (February 2025).

  • DNA Synthesis Screening Cannot Keep Up With AI-Designed Sequences

    DNA Synthesis Screening Cannot Keep Up With AI-Designed Sequences

    DNA Synthesis Screening Cannot Keep Up With AI-Designed Sequences
    Screen
    DNA synthesis companies now screen orders for sequences matching select agents
    AI bypass
    AI protein design can produce functional analogs that evade sequence-similarity screening
    IBBIS
    International Biosecurity and Biosafety Initiative for Science developing updated screening
    Structure
    structure-based screening proposed but not yet deployed at scale

    DNA synthesis companies serve as a critical chokepoint in the biosecurity ecosystem. The argument is straightforward: to create a biological threat agent, an actor needs to obtain its genetic sequence in physical DNA form. DNA synthesis companies, which produce custom DNA sequences on demand for legitimate research, represent the last physical control point before that sequence enters the world. Screening orders against databases of dangerous sequences before synthesizing them should prevent acquisition of threat agents through commercial channels.

    How Current Screening Works

    The International Gene Synthesis Consortium, representing the major commercial DNA synthesis providers, has maintained a voluntary screening commitment since 2009. The screening approach uses sequence alignment algorithms to compare ordered sequences against databases of select agents and toxins listed under biosafety regulations. Orders matching dangerous sequences above a threshold similarity are flagged for manual review and potential rejection.

    The AI Bypass Problem

    The 2024 arXiv preprint from the Johns Hopkins Center for Health Security (arXiv 2406.08027) documented a specific vulnerability. AI protein design tools including ESM3 and RFdiffusion can generate novel sequences with similar three-dimensional structure and function to dangerous proteins but with low sequence similarity to any known protein in screening databases. A viral toxin redesigned by AI to be functionally equivalent but sequentially dissimilar could pass sequence-based screening while retaining biological activity. The IBBIS proposal for functional screening uses AI models that predict protein function from sequence to flag sequences likely to produce dangerous functional outputs, regardless of their similarity to known threat agents.

    What Policy Has Done

    The September 2023 Biden Executive Order on AI specifically addressed AI-enabled biosecurity risks and required NIST, NIAID, and other agencies to develop screening requirements for AI-designed genomic sequences. The IBBIS consortium published a technical framework for next-generation screening in 2024. As of early 2026, large commercial synthesis providers have begun piloting AI-augmented functional screening, but the transition from sequence-similarity to function-based screening is not yet complete across the industry.

    Related coverage: How Protein Language Models Learned to Design Dangerous Proteins | LLMs Give Novice Biologists 4x Uplift on Dangerous Tasks | What ASL-3 Actually Means: Anthropic’s Biorisk Threshold Explained

    Primary sources: Tucker JB et al., “Governing Dangerous Pathogens: A Framework Based on Risk,” available at IBBIS; arXiv preprint on AI bypass of DNA synthesis screening, arXiv 2406.08027.

  • DNA Synthesis Screening Cannot Keep Up With AI-Designed Sequences

    Evo 2: The Genomic Foundation Model Trained on 9.3 Trillion DNA Bases

    DNA Synthesis Screening Cannot Keep Up With AI-Designed Sequences

    In March 2025, ARC Institute published Evo 2 in Science, a 7B-parameter genomic foundation model trained on 9.3 trillion base pairs from approximately 128,000 species. Evo 2 processes up to 1 million base pairs in a single context window.

    The Architecture

    Evo 2 is built on StripedHyena, a hybrid architecture alternating attention layers with Hyena long-range convolution operators. Standard transformer attention has quadratic computational cost with sequence length. Hyena scales sub-quadratically, making million-base-pair contexts tractable.

    At 9.3 trillion training tokens on a 40B parameter architecture, Evo 2 represents an extreme case of inference-optimal training: roughly 230 tokens per parameter, an order of magnitude beyond the Chinchilla-optimal 20:1 ratio. The choice reflects the same logic that drives LLaMA and Qwen to train smaller models on more data: inference will run millions of times and training runs once.

    Mutation Effect Prediction

    Evo 2 achieves state-of-the-art variant effect prediction across coding and non-coding genomic regions. Approximately 90% of disease-associated variants in GWAS studies fall in non-coding regions, where understanding functional impact has been a bottleneck for previous protein-only models.

    CRISPR System Generation

    Evo 2 can generate complete CRISPR-Cas system designs that are functional in experimental characterization, with sequence diversity from known natural systems.

    Biosecurity Considerations

    The same capability enabling generation of novel functional genetic elements for therapeutic research applies to potential pathogen enhancement. ARC Institute released Evo 2 open-source, which became a focal point in the debate about whether genomic foundation models at this scale should be openly released.

    Related coverage: ESM3: The Protein Language Model That Unifies Sequence, Structure and Function | How Protein Language Models Learned to Design Dangerous Proteins | DNA Synthesis Screening Cannot Keep Up With AI-Designed Sequences

    Primary source: Merchant A et al., Science 2025;388:eads9889.

  • Generative AI for Small Molecule Drug Discovery: How It Works and What the Evidence Shows

    ESM3: The Protein Language Model That Unifies Sequence, Structure and Function

    Generative AI for Small Molecule Drug Discovery: How It Works and What the Evidence Shows

    ESM3 from EvolutionaryScale is a generative protein language model that reasons simultaneously across sequence, structure, and function. Published in Science in 2024, the 98-billion-parameter model accepts any combination of partial sequence, partial structure, and functional annotations as conditioning input and generates completions across all three modalities. The model treats protein sequence, structure, and function as three channels of the same underlying biological information, trainable jointly through masked prediction objectives applied across all three.

    The VQ-VAE Structural Tokenization

    To make protein structure tractable as a language model input, ESM3 encodes 3D backbone coordinates through a Vector Quantized Variational Autoencoder that converts continuous coordinate representations into discrete structural tokens. This allows the transformer architecture to treat backbone geometry the same way it treats amino acid sequence tokens: as discrete elements in a vocabulary over which attention operates. The VQ-VAE approach introduced a quantization loss that required careful training to prevent codebook collapse, where most structural tokens cluster around a small number of centroids.

    The GFP Design Demonstration

    EvolutionaryScale demonstrated ESM3 generative capability by generating a sequence for a new green fluorescent protein with only 58% sequence identity to any known natural GFP, then synthesizing and characterizing it experimentally. The protein folded and fluoresced. The evolutionary distance from known sequences is roughly equivalent to the distance from modern humans to Cambrian animal phyla. The demonstration was compelling as a capability proof. It was not a drug discovery result.

    Limitations

    ESM3 was trained predominantly on soluble, single-domain proteins with well-characterized structures in the PDB. Membrane proteins, intrinsically disordered proteins, and large multi-domain complexes are underrepresented in training data. The model’s performance on these classes is substantially worse than on soluble globular proteins.

    This data scarcity problem illustrates a constraint that compute-optimal scaling laws cannot solve: when training data has biological ceiling limits rather than web-crawl limits, adding compute does not close the gap. ESM3’s 98B parameter count is well beyond what the available structural data can optimally train.

    Related coverage: AlphaFold 3 in Drug Discovery: Where It Works and Where It Fails | Evo 2: The Genomic Foundation Model Trained on 9.3 Trillion DNA Bases | RFdiffusion and ProteinMPNN: How AI Now Designs Proteins From Scratch

    Primary source: Hayes T et al., Science 2024 (ESM3).

  • AlphaFold 3 in Drug Discovery: Where It Works and Where It Fails

    AlphaFold 3 in Drug Discovery: Where It Works and Where It Fails

    AlphaFold 3 in Drug Discovery: Where It Works and Where It Fails

    AlphaFold 3 expanded the original AlphaFold 2 architecture from protein structure prediction to joint structure prediction of proteins, nucleic acids, small molecules, and ions simultaneously. Published in Nature in May 2024, it achieved state-of-the-art accuracy on protein-protein interaction interfaces and showed significant improvements on protein-nucleic acid complexes. For drug discovery, the question is whether the accuracy improvements matter where they need to matter most: predicting ligand-binding poses for drug targets.

    Where AlphaFold 3 Works

    AlphaFold 3 uses a diffusion-based architecture called Evoformer-based diffusion that generates 3D coordinates for all molecular components simultaneously rather than predicting backbone then sidechain conformations sequentially. On protein backbone prediction, it matches or exceeds AlphaFold 2 performance. On antibody-antigen interfaces, it achieves sub-Angstrom accuracy in benchmark conditions. For targets where the binding site geometry is dominated by the protein backbone rather than flexible loops, AlphaFold 3 predictions are useful for computational docking.

    Where AlphaFold 3 Fails

    A benchmark study from the Shanghai Institute of Materia Medica (2024) tested AlphaFold 3 on GPCR drug targets, which represent approximately 33% of all FDA-approved drugs. AlphaFold 3 showed significant discrepancies in ligand-binding pose prediction for ions, flexible peptides, and protein ligands at GPCR binding sites. The problem is that GPCRs have highly flexible extracellular loops whose conformations shift dramatically depending on the bound ligand. AlphaFold 3 generates single predicted structures; it does not natively model conformational ensembles. For GPCR drug discovery, experimental structure determination via cryo-EM or X-ray crystallography with the ligand bound remains necessary.

    The Insilico Medicine Workaround

    Insilico Medicine’s AI drug discovery pipeline uses AlphaFold 3 backbone predictions as a starting point, then applies molecular dynamics simulation to generate conformational ensembles around flexible binding sites before docking candidate compounds. This hybrid approach addresses the static structure limitation but requires substantially more compute per target than pure AlphaFold 3 predictions.

    Limitations

    AlphaFold 3’s training data contains solved crystal structures, which are themselves snapshots of single conformations. Models trained on static structures systematically underestimate conformational flexibility. The model cannot predict allosteric conformational changes or cryptic binding sites that open on ligand binding.

    Related coverage: ESM3: The Protein Language Model That Unifies Sequence, Structure and Function | RFdiffusion and ProteinMPNN: How AI Now Designs Proteins From Scratch | AI-Driven ADMET Prediction: What the Blind Challenge Results Show

    Primary sources: Abramson J et al., Nature 2024 (AlphaFold 3); Shanghai Institute of Materia Medica GPCR benchmark 2024, PubMed indexed.

  • AI-Assisted Zoonotic Disease Detection: From SARS to H5N1

    Radiology Foundation Models: What Merlin, the 22% Hallucination Rate, and ED Fracture Data Tell Us

    AI-Assisted Zoonotic Disease Detection: From SARS to H5N1

    Radiology AI has been dominated by narrow task-specific models trained on single imaging modalities for single findings. Merlin, published in Nature Medicine in 2024 by researchers at Mass General Brigham and Harvard Medical School, is a 3D radiology foundation model trained on 110,000 CT volumes: learn general anatomical representations first, then apply them to any downstream task with far less labeled data than task-specific models require.

    What Makes Merlin Different

    Most radiology AI models operate on 2D slices. Merlin processes full 3D CT volumes at native resolution, learning anatomical relationships across axial, coronal, and sagittal planes simultaneously. The pretraining objective combines reconstruction of masked anatomical regions with contrastive learning between imaging and radiology report text. This image-text contrastive pretraining follows the same architectural logic as CLIP-based vision-language models, applied to a medical domain where image-caption pairs are CT volumes paired with radiologist reports rather than web-scraped photographs paired with alt text.

    On downstream tasks, Merlin matched or exceeded task-specific models while requiring approximately 6x fewer labeled fine-tuning examples.

    The Annotation Bottleneck

    Expert radiology annotations are expensive: annotating a single CT volume for complex segmentation can take 30 to 90 minutes. Merlin-class foundation models directly address this by reducing labeled data requirements for new tasks.

    What Foundation Models Still Cannot Do

    Merlin benchmarks reflect performance on tasks included in the training distribution. The model has not been evaluated on rare pathology types, pediatric populations, or imaging protocols significantly different from Mass General Brigham standards. Distribution shift remains an unresolved problem for all radiology foundation models.

    What Happens Next

    The trajectory is toward multimodal foundation models processing CT, MRI, PET, and radiograph simultaneously. The regulatory pathway for foundation model-derived radiology tools under the FDA PCCP framework is an active area of policy development.

    Related coverage: AI in Radiology: Three Phases and What the Clinical Evidence Shows | FDA Clearance for AI Medical Devices: What 510(k), De Novo, and PMA Mean | Poisoning the Medical Brain: RAG Attacks and Security in Clinical AI

    Primary source: Blankemeier L et al., Nature Medicine 2024.

  • RFdiffusion and ProteinMPNN: How AI Now Designs Proteins From Scratch

    AI in Radiology: Three Phases and What the Clinical Evidence Shows

    RFdiffusion and ProteinMPNN: How AI Now Designs Proteins From Scratch
    556
    papers in Li et al. 2025 bibliometric analysis of radiology AI clinical outcomes
    3 phases
    rule-based CAD, DL benchmarks, clinical deployment validation — each with distinct failure modes
    700+
    AI radiology tools cleared by FDA through 2025, predominantly detection not molecular prediction
    0
    large RCTs demonstrating radiology AI reduces mortality at population level

    Radiologists at Mass General Brigham began reading AI-flagged chest CT cases in 2019. By 2023, the institution had accumulated enough outcome data to run a retrospective analysis. The finding, published in Radiology, was not that AI made radiologists faster. It was that the specific radiologists who adopted AI tools showed no significant difference in miss rates compared to those who did not. The AI was being used. It was not changing outcomes.

    This result is not a failure of the technology or the radiologists. It is the expected output of deploying phase-two AI into a phase-three world.

    Phase One: Rule-Based CAD and the False Positive Problem

    Computer-aided detection in radiology has a longer history than the deep learning narrative suggests. The first FDA-cleared CAD system for mammography launched in 1998. R2 Technology’s ImageChecker received clearance based on a dataset of 1,083 mammograms and a claimed 14% improvement in cancer detection. Hospitals bought it. Radiologists used it. And for a decade, CAD was the standard of care for screening mammography at any well-resourced institution.

    The clinical evidence caught up in 2011. Lehman et al. published a retrospective analysis in JAMA covering 684,956 mammography examinations across 90 radiology facilities. Facilities that used CAD had a higher recall rate (meaning more women called back for additional imaging) but no statistically significant improvement in invasive cancer detection at 1 year. CAD was generating alerts. The alerts were not finding cancer. They were finding things that required follow-up, generating cost and patient anxiety, without producing a compensating benefit in outcomes.

    The phase-one lesson: a system that is sensitive enough to flag suspicious regions is not necessarily specific enough to flag the right ones. Rule-based CAD was designed to maximize sensitivity on its training distribution. It learned to minimize missed cancers, not to minimize unnecessary recalls. These are different optimization targets, and the clinical system pays the cost of the latter.

    Phase Two: The Deep Learning Benchmark Era

    ImageNet changed the architecture of radiology AI research more than any clinical finding. Between 2016 and 2021, convolutional neural networks trained on large annotated medical imaging datasets produced a series of benchmark results that circulated as evidence that AI would replace radiologists within a decade.

    The landmark papers are genuinely impressive. Rajpurkar et al. at Stanford published CheXNet in 2017, a 121-layer DenseNet trained on 112,120 chest X-rays that reported F1 scores matching or exceeding the average of four radiologists on pneumonia detection. McKinney et al. at Google reported in Nature in 2019 that a DL model for breast cancer screening outperformed an average of six radiologists, with a 5.7% reduction in false positives and a 9.4% reduction in false negatives on UK data. Ardila et al. reported in Nature Medicine that a DL model for lung cancer detection outperformed six radiologists on low-dose CT.

    Every result was real on the test set it was measured against. Every comparison used radiologists reading images in isolation, without patient history, prior studies, or the ability to request additional imaging. Working radiologists do not interpret images this way. The benchmark measured a constrained version of radiologist performance against a model trained specifically on the task. The comparison was never between AI and clinical radiology. It was between AI and the subset of radiologist cognition that fits inside a forced-choice classification task.

    Phase two also created a benchmark replication problem. External validation of published DL radiology models on independent test sets routinely showed performance drops of 10 to 20 percentage points. A 2021 Nature Machine Intelligence paper by Oakden-Rayner et al. found that multiple high-profile chest X-ray classification models had learned to predict from spurious features, including the radiographic markers attached to pacemakers, which are absent in healthy patients, rather than from pathology signals. The models were right for the wrong reasons.

    Phase Three: Clinical Deployment Validation

    The 556-paper bibliometric analysis by Li et al., published in European Radiology in 2025, is the most comprehensive systematic review of radiology AI clinical evidence available. The researchers categorized studies by whether they measured imaging performance metrics (benchmark accuracy) or patient-level clinical outcomes (what happened to the patient). Of studies reaching the clinical outcomes tier, the majority found no statistically significant improvement over standard of care.

    Three mechanisms account for most of the gap. First, distribution shift: models trained on imaging data from academic medical centers fail on community hospital scanners using different reconstruction kernels, contrast protocols, and patient demographics. A model that achieves 94% AUC on Mass General Brigham data may achieve 78% on a rural Oklahoma hospital’s equipment. Second, alert fatigue: AI triage tools that flag 15% of cases as requiring urgent review generate radiologist fatigue when the positive predictive value is low. Radiologists learn to treat the AI output as noise. Third, workflow integration failure: AI outputs that arrive in the PACS system two hours after the radiologist has already read the case add zero clinical value regardless of accuracy.

    The multicenter thymus CT validation study published in European Radiology in 2025 represents what phase-three success looks like. The task (segmenting thymic tissue from chest CT) has high inter-reader variability among human radiologists because the thymus involutes with age and its boundaries are genuinely ambiguous in adults. AI segmentation achieved lower inter-reader variability than the human baseline, on a clinically relevant task, across multiple institutions. The AI was solving a real problem where human performance was legitimately inconsistent. That is why the validation held.

    Where AI Radiology Actually Works in 2026

    Validated clinical gains cluster around applications with three shared properties: the task is well-defined, the consequence of delay is measurable, and the training distribution matches the deployment environment. Intracranial hemorrhage triage on non-contrast CT meets all three criteria. The finding is binary, the clinical consequence of a missed or delayed read is stroke progression, and the imaging protocol is standardized enough that distribution shift is manageable. Multiple prospective studies have confirmed that AI triage reduces door-to-treatment time in stroke workflows.

    Lung nodule tracking software meets the first and third criteria. The Lung-RADS classification system provides a standardized framework, training data is abundant, and the benefit (reduced missed follow-up on incidental findings) is directly measurable in health system audit data. Automated bone age estimation from pediatric hand X-rays has shown reliable performance across demographics because the Greulich-Pyle atlas provides a standardized reference and the imaging protocol does not vary significantly between institutions.

    Where AI radiology fails to demonstrate clinical benefit is in applications that require contextual clinical reasoning, integration of longitudinal patient data, or adaptation to imaging protocols not represented in the training distribution. These are not tractable with current architectures. Foundation models such as Merlin, covered separately in our analysis of the Nature Medicine paper, address the data efficiency problem by reducing the labeled examples required for fine-tuning. They do not resolve distribution shift on their own.

    Limitations of the Current Evidence Base

    No large randomized controlled trial has demonstrated that radiology AI integration reduces mortality, morbidity, or total cost of care at the population level. The clinical evidence base consists predominantly of retrospective single-center studies with known selection bias toward successful implementations. FDA clearance under the de novo pathway establishes safety and effectiveness relative to a predicate device, not superiority over standard of care. Cleared does not mean clinically beneficial.

    Alert fatigue has not been rigorously measured across institutions. Most implementations do not publish their override rates, which are the most direct signal of whether radiologists trust the AI output. The models generating alerts are typically black boxes even under FDA clearance, and the interpretability tools available do not explain individual predictions in terms radiologists can validate against clinical reasoning.

    The phase-three evidence gap is not a temporary problem that more deployment will resolve. It reflects a structural mismatch between how models are evaluated before deployment (controlled test sets, isolated image reading, single-institution data) and how they are used after deployment (real workflow integration, variable imaging protocols, longitudinal patient context). Closing that gap requires prospective study designs and outcome metrics that most deployments do not currently collect.

    Primary sources: Li et al. 2025 bibliometric analysis, European Radiology. Lehman et al. 2011, JAMA. Rajpurkar et al. 2017 CheXNet, arXiv. McKinney et al. 2019, Nature. Ardila et al. 2019, Nature Medicine. Oakden-Rayner et al. 2021, Nature Machine Intelligence. FDA AI/ML medical device database.