Santiago Maniches

Open-Weight LLM Rankings, April 2026: MMLU Is Saturated, Here’s What to Use Instead

April 26, 2026

MMLU is saturated. In April 2026, the metrics that matter are SWE-bench Verified, GPQA Diamond, and RULER’s effective context window. Chinese labs hold 4 of the top 5…

ARC-AGI-3 Is Live. Here’s Why Current Models Score in the Low Double Digits.

April 26, 2026

ARC-AGI-3 launched on Kaggle with a $1M prize and current leaders in low double digits. The benchmark adds Exploration, Modeling, and Planning that test-time compute scaling cannot solve.…

ICLR 2026 Outstanding Papers: What They Actually Found, and the Review Crisis Around Them

April 26, 2026

ICLR 2026 named two outstanding papers: LLMs Get Lost In Multi-Turn Conversation and Transformers are Inherently Succinct. The conference also documented a 45% identity leak and 21% AI-generated…

Agent Memory Architecture: Four Patterns, Four Tradeoffs

April 26, 2026

Agent memory is not one thing. It is four distinct patterns: full context window, hierarchical summarization, external vector store, and episodic log. Each has different performance, cost, failure…

OpenAI Codex at 3 Million Users: How It Differs from Claude Code

April 26, 2026

Codex has 3M weekly users. Claude Code runs in your terminal. The architectural difference between cloud loop and local execution determines which tasks each tool handles well —…

Why 86% of Enterprise AI Agent Pilots Never Reach Production

April 26, 2026

Multiple independent studies in 2026 put the enterprise AI agent pilot failure rate at 86-89%. Six failure modes account for the losses. Here’s what they are, what causes…

Amazon Bedrock AgentCore: What Each Layer Does and Why It Matters

April 26, 2026

Amazon Bedrock AgentCore is six infrastructure services in one name. Here’s what each layer does: Runtime for serverless execution, Memory’s four tiers, Tool Execution’s sandboxing, Action Gateway’s enterprise…

Google Cloud Next 2026: The Agent Infrastructure Stack Explained

April 26, 2026

Google Cloud Next 2026 announced N4A Axion CPU instances for agent orchestration, GKE Agent Sandbox with gVisor isolation, and native A2A support in ADK. Here’s what each layer…

Know Your Agent: The First Regulated AI Agent Governance Standard

April 26, 2026

MetaComp’s StableX KYA Framework, published April 21, 2026, is the first governance standard for AI agents from a licensed financial institution. Here’s what its four pillars cover, how…

Half of Organizations Have No Visibility Into AI Agent Traffic

April 26, 2026

Salt Security’s H1 2026 report: 48.9% of organizations have zero visibility into AI agent traffic. WAFs were built for humans. Here’s why that gap exists structurally, what the…

Why OpenAI’s Agent Runtime Lives on AWS, Not Azure

April 26, 2026

OpenAI’s stateful runtime runs on AWS, not Azure. That’s not a partnership detail: it’s a contract clause. Here’s the stateless-vs-stateful architectural split, why production agents break on stateless…

A2A Protocol v1.0: The Agent Communication Layer MCP Doesn’t Cover

April 26, 2026

A2A Protocol v1.0 introduced Signed Agent Cards and gRPC support. Here’s how agent-to-agent communication differs from MCP tool calls, why IBM merged ACP into A2A, and what the…

SmolVM: Firecracker-Backed MicroVM Sandbox for AI Agent Code Execution

April 26, 2026

SmolVM gives AI agents a hardware-isolated disposable VM using Firecracker. Here’s why Docker containers are the wrong sandbox for LLM-generated code, how the snapshot-fork pattern works, and how…

AI Coding Tools Quadrupled Critical Vulnerability Density. 216 Million Findings Prove It.

April 24, 2026

OX Security analyzed 216 million findings across 250 organizations. Critical vulnerability density grew 400% while alert volume grew 52%. The difference is directly correlated with AI coding tool…

5 of 7 Major MCP Clients Don’t Validate Tool Metadata. Here’s the Gap.

April 24, 2026

5 of 7 major MCP clients tested skip static validation of tool metadata entirely. A March 2026 arXiv paper is the first systematic evaluation of MCP client-side security,…

MCP-SafetyBench at ICLR 2026: No LLM Agent Can Be Both Useful and Secure

April 24, 2026

MCP-SafetyBench at ICLR 2026 finds a negative correlation between defense success and task success across all 20 MCP attack types. No model achieves both. Here’s what the tradeoff…

Bitwarden CLI Was a Supply Chain Bomb. Checkmarx Lit the Fuse.

April 24, 2026

The Checkmarx supply chain breach reached Bitwarden’s CLI in 93 minutes on April 22. Here’s how bw1.js stole CI/CD secrets and why security-tool supply chains fail in the…

LMDeploy CVE-2026-33626: SSRF Weaponized in 13 Hours

April 24, 2026

LMDeploy SSRF bug CVE-2026-33626 was exploited 13 hours post-disclosure. Full attack chain, AWS credential blast radius, and why AI inference servers are unusually dangerous SSRF targets.

Full Context Sets the Accuracy Ceiling for AI Agent Memory. It Costs 26,000 Tokens Per Query. Here Is the Tradeoff Map.

April 21, 2026

Full context memory sets the accuracy ceiling at a cost of 26,000 tokens per query. Vector-only memory scores 66.9% at 1.44s p95 latency. Graph memory reaches 68.4% at…

98.4% of Claude Code Is Operational Infrastructure. A New arXiv Paper Maps All of It.

April 21, 2026

A source-code analysis of Claude Code’s 512,000-line TypeScript codebase finds 98.4% is operational infrastructure, not AI. Here is the five-layer compaction pipeline, the 17% comprehension decline finding, the…

Author: Santiago Maniches