ARC-AGI-3 Drops Frontier AI Models Below 1%: The First Benchmark That Tests Whether AI Can Actually Learn

6–9 minutes

·

·

The ARC Prize Foundation launched ARC-AGI-3 on March 25, 2026, at Y Combinator in San Francisco. The results are blunt: Gemini 3.1 Pro, the top-scoring frontier model, hit 0.37%. GPT-5.4 and Claude Opus 4.6 scored between 0% and 0.37%. The best purpose-built agent managed 12.58%. Humans cleared the benchmark at 100%.

This is not a marginal gap. It is a structural failure. And it reveals something specific about what current AI systems cannot do, no matter how large they get.

Why ARC-AGI-3 Exists

The benchmark’s creator, François Chollet, has argued since 2019 that the AI field measures intelligence incorrectly. Popular benchmarks like MMLU, GPQA, and HumanEval reward a model’s ability to recall patterns from massive training corpora. They measure crystallized intelligence (what you know) rather than fluid intelligence (how quickly you learn new things from scratch).

ARC-AGI-1, released in 2019, tested this with static grid puzzles: show the model a few input-output examples, ask it to infer the transformation rule, and apply it to a new case. For five years, AI systems struggled. Then test-time training and large reasoning models pushed scores past 50% in 2024, and by early 2026, Gemini 3.1 Pro was scoring 98% on version 1.

But there was a problem. The ARC Prize team found evidence that frontier models had effectively memorized the benchmark. During verification, Gemini 3’s reasoning chain correctly referenced the integer-to-color mapping used in ARC tasks without being told what it was. The benchmark’s format was well-represented in the training data. Whether the contamination was incidental or intentional, the ARC team could not tell. Either way, versions 1 and 2 were approaching saturation.

ARC-AGI-3 is the response. It abandons static puzzles entirely.

How ARC-AGI-3 Works

Each ARC-AGI-3 environment is a turn-based game rendered on a 64×64 pixel grid with 16 colors. An agent is dropped in with zero instructions. No rules. No goals. No text descriptions. The agent sees a visual state, takes an action (arrow keys or pixel clicks depending on the game), observes the result, and must figure out everything from there: what it can do, what the win condition is, and how to reach it efficiently.

The benchmark includes hundreds of handcrafted environments, each containing 8 to 10 levels of increasing mechanical complexity. Each new level introduces rules while retaining earlier ones. An agent that memorized level 1 will face a different problem on level 3.

The design tests four capabilities that static benchmarks cannot measure. Exploration: can the agent efficiently gather information from the environment through its own choices? Modeling: can it build a working mental map of how the environment behaves? Goal-setting: can it infer the objective without being told? Planning: can it execute a multi-step strategy and course-correct when new information arrives?

All environments use only Core Knowledge priors: objectness, basic spatial reasoning, simple physics. No language. No cultural symbols. No domain expertise. If a typical eight-year-old can figure it out in minutes, it counts. This constraint isolates raw learning ability from accumulated knowledge.

The Scoring Mechanism: RHAE

ARC-AGI-3 introduces a metric called Relative Human Action Efficiency (RHAE). It does not simply ask whether the agent finished the task. It measures how many actions the agent needed compared to the second-best human performance on the same level.

This distinction matters. An agent that stumbles through a game with 500 random actions and eventually wins by luck gets a near-zero score. An agent that observes the environment, forms a hypothesis, tests it with a few targeted moves, and completes the level in 12 steps gets a high score. RHAE captures intelligence as efficiency: how much useful behavior does the agent extract per unit of information gathered?

This definition traces directly to Chollet’s 2019 paper “On the Measure of Intelligence,” which defined intelligence as skill-acquisition efficiency over a scope of tasks. ARC-AGI-3 is the first benchmark to operationalize that definition in an interactive setting.

The human baseline comes from 1,200+ players who completed 3,900+ game sessions during a 30-day preview period. Every environment included in the final benchmark was independently solved by at least two participants from the general public. Median human solve time: 7.4 minutes per session.

Why Frontier Models Fail

Large reasoning models like GPT-5.4 and Claude Opus 4.6 are powerful at reasoning within domains they have trained on. When the problem resembles something in the training distribution, they perform well, often better than humans on domain-specific benchmarks like GPQA Diamond or SWE-Bench Verified.

ARC-AGI-3 environments are genuinely novel. They are not variants of Atari games. They are not reskinned puzzles from existing training data. Each one was hand-designed to have no public analog. The agent cannot rely on pattern matching against prior experience because there is no prior experience.

The specific failure mode is instructive. LLMs process environments as sequences of visual observations, but they lack the ability to maintain persistent state across interaction steps. They struggle to form hypotheses and revise them based on feedback. Most critically, they cannot execute the exploratory loop that humans perform instinctively: try something, observe what happens, update your model, try something better.

The best-performing AI system in the 30-day preview, scoring 12.58%, was not a bare LLM. It was a purpose-built agent with graph-based state tracking, visual salience heuristics, and frontier-driven exploration. A team from the University of Helsinki published the method as a training-free approach that maintains a directed graph of explored states and action transitions. It outperformed every LLM significantly, not because it was smarter, but because it had the infrastructure to remember, plan, and explore systematically.

What This Means for AGI Claims

The AI industry spent 2025 and early 2026 declaring that models were approaching or exceeding human-level performance. GPT-5.4 scored 75% on OSWorld, above the 72.4% human baseline. Claude Opus 4.6 autonomously discovered adversarial attacks that outperformed every known method. Benchmarks kept falling.

ARC-AGI-3 resets the conversation. A 100:1 gap between human and AI performance on a task designed to be easy for eight-year-olds is not a minor calibration issue. It points to a missing capability class: the ability to learn from interaction in real time, without training data, without pre-existing patterns to match against.

Chollet’s thesis has always been that current AI excels at crystallized intelligence (applying stored knowledge) but lacks fluid intelligence (acquiring new skills from scratch). ARC-AGI-3 is the strongest evidence yet that this distinction is real and measurable.

The benchmark also addresses the contamination problem directly. Because environments are interactive and turn-based, pre-training on static data provides zero advantage. The agent’s performance depends entirely on its ability to learn within the episode. There is nothing to memorize in advance.

Limitations and Open Questions

ARC-AGI-3 is not a perfect measure of intelligence. The environments are 2D grids with discrete actions. They do not test language understanding, social reasoning, tool use, or physical manipulation. An agent that masters ARC-AGI-3 has demonstrated rapid in-context learning but not anything approaching the full scope of human cognition.

The benchmark’s reliance on Core Knowledge priors also raises questions about cultural fairness. While the designers deliberately excluded language and cultural symbols, the assumption that “objectness” and “basic physics” are culturally neutral is debatable.

The 12.58% top score also shows that purpose-built systems can make partial progress. If the scoring metric (RHAE) correlates with meaningful capability improvements, then the gap may close faster than the raw numbers suggest. Test-time training methods that pushed ARC-AGI-1 from 20% to 53% in a single year could plausibly do something similar here, especially as agentic architectures mature.

The ARC Prize 2026 competition runs three parallel tracks with a $2 million prize pool. The ARC-AGI-3 track has milestone checkpoints on June 30 and September 30, with submissions closing November 2 and results announced December 4. All winning solutions must be open-sourced under MIT or CC0 licenses, and Kaggle evaluation runs without internet access, preventing API calls to external inference endpoints.

Where This Goes From Here

ARC-AGI-3 is the first benchmark that formally measures human versus AI learning efficiency rather than task completion. That distinction changes what progress looks like. Beating this benchmark does not require a bigger model or more training data. It requires a different kind of system: one that can explore, hypothesize, and adapt in real time.

The irony is that the capabilities ARC-AGI-3 tests, exploration, memory, hypothesis formation, goal inference, are precisely the ones the AI industry is trying to build with agentic systems. Companies investing billions in AI agents that can operate autonomously in enterprise environments should pay attention to this benchmark. If an agent cannot figure out how a simple grid game works without instructions, deploying it to manage complex software workflows is premature.

The ARC Prize Foundation calls ARC-AGI-3 the only unsaturated general agentic intelligence benchmark as of March 2026. Based on the results, it has plenty of runway left.

ARC-AGI-3 launched March 25 as the first interactive reasoning benchmark for AI agents. The best frontier LLMs scored under 1%. The best purpose-built agent scored 12.58%. Humans scored 100%. Here is how the benchmark works, why static puzzles no longer measure intelligence, and what the gap means for AGI claims.

One response to “ARC-AGI-3 Drops Frontier AI Models Below 1%: The First Benchmark That Tests Whether AI Can Actually Learn”

  1. […] that some informational traffic is gone permanently. Original reporting on breaking events and deep technical analysis are examples of content that AI Overviews cannot replicate because they require primary sources, […]

    Like

Feature is an online magazine made by culture lovers. We offer weekly reflections, reviews, and news on art, literature, and music.

Please subscribe to our newsletter to let us know whenever we publish new content. We send no spam, and you can unsubscribe at any time.

← Back

Thank you for your response. ✨

Designed with WordPress.