Benchmarks – My Written Word

Red-Teaming LLM Applications: A Practitioner’s Framework

May 24, 2026

LLM red-teaming spans three distinct surfaces: model layer (jailbreaking), application layer (injection), and supply chain. Different attacks, different defenses, different responsible parties. Here is the methodology that covers…

OWASP LLM Top 10 for 2025: The Mechanism Behind Each Vulnerability

May 18, 2026

The OWASP LLM Top 10 for 2025 added System Prompt Leakage and Vector Weaknesses, reworked Excessive Agency, and moved Sensitive Disclosure to second place. Here is the architectural…

Indirect Prompt Injection: The Attack That Hides in Your Data

May 18, 2026

Indirect prompt injection lets attackers hijack LLMs by hiding instructions in documents, web pages, and tool results the model processes. Here is why the architecture makes this unavoidable…

Julia Bazinska and the Science of Measurable AI Security

May 18, 2026

Julia Bazinska built the empirical tools that make LLM security measurable. From DeepMind RL to first-authoring b3, here is what her research at Lakera actually produced.

Gandalf the Red: What 279K Real Attacks Reveal About LLM Defense

May 18, 2026

Lakera’s ICML 2025 paper ran 279K crowdsourced attacks to show what synthetic red-teaming misses. The D-SEC finding: system prompts degrade user experience without blocking attackers. Here is the…

Vision-Language Models: Architecture and the Benchmark Gap

May 18, 2026

How CLIP, SigLIP, Q-Former, and MLP adapters work in vision-language models. Why Qwen2.5-VL compresses visual tokens 4x, and what current VLMs still cannot do.

AI in Radiology: Three Phases and What the Clinical Evidence Shows

May 10, 2026

Radiology AI has moved through three phases: rule-based CAD, the deep learning benchmark era, and clinical deployment validation. A 556-paper bibliometric analysis and a multicenter thymus CT validation…

LLMs Give Novice Biologists 4x Uplift on Dangerous Tasks

May 10, 2026

A 2026 study measured LLM access giving novice biologists a 4.16x accuracy boost on biosecurity-relevant tasks, including beating expert baselines. Here is the mechanism and what it means…

MiniMax M2.7 Optimized Its Own Training Harness 100 Times. Here Is the Loop.

May 5, 2026

MiniMax M2.7 ran an internal agent that modified its own training scaffold 100 times in a row without human input and gained 30% on internal evaluations. Here is…

KellyBench: 8 AI Models Bet the Premier League. All Lost Money.

May 5, 2026

General Reasoning put 8 frontier AI models through a full Premier League season with a 100k bankroll each. Every model lost money. The benchmark reveals three distinct failure…

DeepSeek V4’s Hybrid Attention Cuts KV Cache by 10x. Here’s the Architecture.

May 2, 2026

DeepSeek V4-Pro processes one million tokens using 10% of the KV cache V3.2 needed. The mechanism is Hybrid Attention: two complementary compressors interleaved across 61 layers. Here’s how…

Open-Weight LLM Rankings, April 2026: MMLU Is Saturated, Here’s What to Use Instead

April 26, 2026

MMLU is saturated. In April 2026, the metrics that matter are SWE-bench Verified, GPQA Diamond, and RULER’s effective context window. Chinese labs hold 4 of the top 5…

ARC-AGI-3 Is Live. Here’s Why Current Models Score in the Low Double Digits.

April 26, 2026

ARC-AGI-3 launched on Kaggle with a $1M prize and current leaders in low double digits. The benchmark adds Exploration, Modeling, and Planning that test-time compute scaling cannot solve.…

ICLR 2026 Outstanding Papers: What They Actually Found, and the Review Crisis Around Them

April 26, 2026

ICLR 2026 named two outstanding papers: LLMs Get Lost In Multi-Turn Conversation and Transformers are Inherently Succinct. The conference also documented a 45% identity leak and 21% AI-generated…

Why 86% of Enterprise AI Agent Pilots Never Reach Production

April 26, 2026

Multiple independent studies in 2026 put the enterprise AI agent pilot failure rate at 86-89%. Six failure modes account for the losses. Here’s what they are, what causes…

AI Coding Tools Quadrupled Critical Vulnerability Density. 216 Million Findings Prove It.

April 24, 2026

OX Security analyzed 216 million findings across 250 organizations. Critical vulnerability density grew 400% while alert volume grew 52%. The difference is directly correlated with AI coding tool…

MCP-SafetyBench at ICLR 2026: No LLM Agent Can Be Both Useful and Secure

April 24, 2026

MCP-SafetyBench at ICLR 2026 finds a negative correlation between defense success and task success across all 20 MCP attack types. No model achieves both. Here’s what the tradeoff…

GLM-5.1 Ran Autonomously for 8 Hours Across 6,000 Tool Calls. How It Beat Claude Opus 4.6 on SWE-Bench Pro and Lost on Verified.

April 13, 2026

Z.ai released GLM-5.1 open-source under MIT on April 7, 2026. The 744B-parameter MoE scored 58.4 on SWE-Bench Pro, beating Claude Opus 4.6 and GPT-5.4. It also ran 655…

Abstract visualization of code editing tools and benchmark data flowing between multiple AI model nodes on a dark background

One Developer Improved 15 LLMs at Coding by Changing the Edit Tool. Grok Went From 6.7% to 68.3%.

April 12, 2026

Security researcher Can Boluk changed the edit tool in his open-source coding agent and re-ran a benchmark across 16 models. Grok Code Fast 1 jumped from 6.7% to…

Gemini 3.1 Pro Cut Hallucinations 38 Points Without Learning Anything New. Its Accuracy Actually Went Down.

April 9, 2026

Google’s Gemini 3.1 Pro cut its hallucination rate on Artificial Analysis’s AA-Omniscience benchmark from 88 percent to 50 percent in three months, the largest single improvement ever measured…

Tag: Benchmarks