ARC-AGI-3 Drops Frontier AI Models Below 1%: The First Benchmark That Tests Whether AI Can Actually Learn

AI Benchmarks — March 25, 2026

ARC-AGI-3 Drops Frontier Models Below 1%.
Humans Score 100%.

ARC-AGI-3 launched March 25 as the first interactive reasoning benchmark for AI agents. The best frontier LLMs scored under 1%. The best purpose-built agent scored 12.58%. Humans scored 100%. Here is how it works and what the gap means.

<1%

Frontier LLM Score

Best frontier models score below 1% on ARC-AGI-3. Interactive tasks expose the capability gap.

12.58%

Best Agent Score

Purpose-built agent architecture. Still 87 points behind human baseline of 100%.

100%

Human Baseline

Human test-takers score 100%. The gap is not closing. Interactive learning is the key variable.

Live

Interactive Format

First benchmark requiring real-time interaction. Static text puzzles no longer measure intelligence.

Sources: ARC Prize Foundation; ARC-AGI-3 benchmark paper; leaderboard results; Chollet interview; March 2026.

The ARC Prize Foundation launched ARC-AGI-3 on March 25, 2026, the first interactive AI benchmark that tests whether systems can explore unfamiliar environments, infer goals, and solve problems without any instructions. Every frontier model tested scored below 1%: Gemini 3.1 Pro hit 0.37%, GPT-5.4 reached 0.26%, Claude Opus 4.6 managed 0.25%, and Grok-4.20 scored 0.00%. Humans solved 100% of environments with no prior training. The competition offers $2 million in prizes, with a $700,000 grand prize for the first agent to achieve human-level performance. All winning solutions must be open-sourced.

Two days before the launch, NVIDIA CEO Jensen Huang told Lex Fridman “I think we’ve achieved AGI.” ARC-AGI-3’s results arrived as a 99.63-percentage-point counterargument. The benchmark does not test knowledge, coding ability, or language comprehension. It tests whether AI systems can adapt to completely novel situations the way humans naturally do. On that metric, the gap is not closing. It is enormous.

What Changed From ARC-AGI-1 and ARC-AGI-2

ARC-AGI-1 (2019) and ARC-AGI-2 (2025) presented static grid puzzles: show a model input-output pairs, ask it to infer the transformation rule and produce the correct output for a new instance. Frontier models reached 90%+ on version 1 by 2025, largely through scaffolding techniques (wrapping models in test-time compute loops with verification). ARC-AGI-2 raised difficulty with compositional puzzles, but the format remained the same: observe patterns, produce outputs.

ARC-AGI-3 abandons static puzzles entirely. Each of the 135 environments is a turn-based interactive game built by an in-house game studio. The agent sees a visual state, takes an action, observes the result, and must figure out both what it is trying to do and how to do it. There are no instructions. No stated goals. No hints. No descriptions. The agent must explore, form hypotheses about the game’s rules, and execute a plan. This is the first major format change since Chollet introduced the original benchmark in 2019.

How the Scoring Works

ARC-AGI-3 uses Relative Human Action Efficiency (RHAE). The baseline is the second-best first-run human performance on each environment. If a human completes a level in 10 actions and an AI takes 100 actions, the AI does not score 10%. The formula squares the ratio: (human actions / AI actions) squared. So 10x more actions produces a 1% score, not 10%. The penalty for inefficiency is deliberately harsh. Wandering, backtracking, and guessing are punished quadratically.

A hard cutoff stops AI agents at 5x the human action count. If a human takes 10 actions to complete a level, the AI is terminated after 50 actions. This prevents models from brute-forcing solutions through exhaustive exploration. The scoring system measures learning efficiency, not just task completion: can the agent figure out the rules and act on them with human-like economy of action?

Why Frontier LLMs Failed This Badly

The sub-1% scores are not a function of perception. A Duke University team built a custom harness for Claude Opus 4.6 that scored 97.1% on a single known environment variant (TR87). The same model scored 0% on unfamiliar environments. This demonstrates that the bottleneck is not visual processing or API format comprehension. Claude can see the game state clearly. It cannot generalize strategies to environments it has not been specifically engineered to handle.

The interactive format exposes a limitation that static benchmarks never tested: sustained sequential reasoning across hundreds of steps, state tracking over long horizons, and learning from environmental feedback in real time. Language models are trained to produce the most likely next token given a context. ARC-AGI-3 requires forming a model of an unknown dynamic system, testing hypotheses through action, and revising understanding based on results. That capability does not emerge from scale alone.

The 12.58% That Matters More Than 0.37%

During the 30-day developer preview, the best-performing system scored 12.58%. It was not a frontier LLM. It was a simpler RL and graph-search approach built by Tufa Labs. That score outperforms every frontier model by more than 30x. The implication is direct: the path to solving ARC-AGI-3 runs through algorithmic innovation in sequential decision-making under uncertainty, not through scaling language models. Classical AI techniques (reinforcement learning, search, planning) outperform the most expensive models in the world on tasks that require genuine adaptation.

This finding aligns with what researchers have observed in agentic AI more broadly: the best results often come from hybrid approaches that combine LLM reasoning with structured search and planning, rather than from end-to-end LLM generation. ARC-AGI-3 provides the first quantitative benchmark for measuring this gap at scale.

What ARC-AGI-3 Does and Does Not Measure

Honest Benchmark Assessment

What it measures well: Fluid intelligence, adaptive reasoning, goal inference, hypothesis formation, and learning efficiency in novel environments. These are genuine components of general intelligence that static benchmarks cannot test.

What it does not measure: Language understanding, world knowledge, coding ability, mathematical reasoning, social intelligence, or any capability that relies on training data. ARC-AGI-3 is deliberately narrow. Scoring 0.25% on ARC-AGI-3 does not mean Claude Opus 4.6 is only 0.25% intelligent.

The moving goalpost critique: ARC-AGI-1 got saturated, so they built ARC-AGI-2. ARC-AGI-2 is getting solved, so they built ARC-AGI-3. If the bar moves every time AI approaches it, the benchmark never declares AGI achieved. That is either rigorous methodology (the previous version stopped measuring anything useful) or a self-perpetuating irrelevance machine, depending on your view.

The harness gap: The official leaderboard bans custom-built harnesses. The community leaderboard allows them. Symbolica AI’s multi-agent harness solved all three public preview environments. Whether “general intelligence” should exclude human-engineered scaffolding is a philosophical question the benchmark embeds as an assumption.

What This Means for the AGI Timeline

OpenAI, Google DeepMind, Anthropic, and xAI all report ARC scores on their model cards. None of them are close on ARC-AGI-3. The benchmark’s competition runs through December 2026 with milestone checkpoints in June and September. Whether any team reaches 50% by year-end is genuinely uncertain. The competition requires open-source solutions with no external API calls during evaluation, meaning you cannot rely on frontier model inference.

Huang’s “AGI is here” and ARC-AGI-3’s 0.37% coexist because they measure fundamentally different things. Huang means AI can perform most economically valuable tasks better than most humans most of the time, which is defensible. ARC-AGI-3 measures adaptive reasoning in environments where training data provides zero advantage, where models must learn from scratch through interaction. On that metric, the gap is 99.63 percentage points wide. The question of whether AGI has arrived depends entirely on which definition you use. ARC-AGI-3 makes the definitional choice explicit and measurable.

Sources: ARC-AGI-3 Technical Report, ARC Prize Foundation, March 25, 2026; ARC Prize 2025 Results Analysis; Decrypt analysis; The Decoder coverage; DEV Community technical breakdown; ARC Prize 2026 Kaggle competition page.

Francois Chollet created the original ARC in 2019 alongside his paper “On the Measure of Intelligence,” which argued that intelligence should be measured as skill-acquisition efficiency rather than task-specific performance. Seven years later, ARC-AGI-3 is the most complete implementation of that philosophy: a benchmark where the only way to score well is to learn quickly from scratch. The $2 million prize pool, the open-source requirement, and the Kaggle infrastructure mean that the solutions will be public and reproducible. If someone cracks ARC-AGI-3, the entire research community will know exactly how. That transparency is the benchmark’s most underappreciated feature.

ARC-AGI-3 Drops Frontier AI Models Below 1%: The First Benchmark That Tests Whether AI Can Actually Learn

What Changed From ARC-AGI-1 and ARC-AGI-2

How the Scoring Works

Why Frontier LLMs Failed This Badly

The 12.58% That Matters More Than 0.37%

What ARC-AGI-3 Does and Does Not Measure

What This Means for the AGI Timeline

Share this:

Like this:

More posts

Why a 1M-Token Model Only Reasons Over 200K

The Jailbreak Hiding in Your JSON Schema

Ghost Vectors: Deleted Embeddings Stay Recoverable

How Model Merging Actually Combines Separate LLMs

Discover more from My Written Word