ARC-AGI-3 Is Live. Here’s Why Current Models Score in the Low Double Digits.

ARC-AGI-3 Is Live. Here’s Why Current Models Score in the Low Double Digits.
ARC-AGI-3 Is Live. Here’s Why Current Models Score in the Low Double Digits.

ARC-AGI-3 launched on Kaggle in April 2026 with a $1 million grand prize for the first submission that scores 100% on the evaluation. No team has come close. The current milestone leaders are scoring in the low double digits on a benchmark that previous generations of ARC-AGI thought were the hard part. That gap is not a failure of the competitors. It is the benchmark doing what Francois Chollet designed it to do: resist the techniques that solved prior versions.

Understanding what ARC-AGI-3 is actually testing requires a precise account of what its predecessors tested, what the winning solutions did, and why Chollet and the ARC Prize team concluded that those solutions, however impressive, were not measuring what they set out to measure. The resulting redesign changes the task at a fundamental level.

What ARC-AGI-1 and ARC-AGI-2 Were Testing

The original ARC benchmark, published by Chollet in 2019, presented a simple surface structure: small grid patterns with input-output examples, and a test input requiring the solution to identify the underlying transformation and apply it to produce the correct output. The grids are small (typically under 30×30), the transformations are human-intuitive (rotations, color substitutions, pattern completions, reflections), and the correct answer can be verified in milliseconds.

For the first several years, this benchmark resisted automated AI solutions in ways that felt meaningful. GPT-3 scored near zero. GPT-4 scored low single digits. Claude 2 and Gemini 1.0 were similarly limited. The benchmark appeared to measure genuine fluid reasoning rather than pattern matching against training data.

ARC-AGI-2, launched in 2024 as a harder version, produced similar resistance initially. Then the GPT-o1 model family and its reasoning chain descendants began cracking it. By late 2025, leading solutions on ARC-AGI-2 were scoring above 60% using test-time compute scaling: running many reasoning attempts per puzzle and selecting the most consistent output. The winning ARC Prize 2025 solution scored 87.5% on the public leaderboard.

Chollet’s analysis was direct. The solutions that achieved high scores on ARC-AGI-2 were not solving the reasoning problem the benchmark was designed to measure. They were exploiting test-time compute scaling, program synthesis with extended search, and in some cases training on augmented datasets that included ARC-style transformations. The 87.5% score looked like a success on the benchmark while representing, in Chollet’s framing, a failure of the benchmark to measure what it claimed to measure.

How ARC-AGI-3 Changes the Task Structure

ARC-AGI-3 adds three required capabilities that the prior versions did not test: Exploration, Modeling, and Planning and Execution.

Exploration means the agent must actively gather information by interacting with an environment rather than receiving all relevant information passively in the prompt. An ARC-AGI-1 puzzle presents everything the solver needs: the input-output examples, the test input, nothing hidden. An ARC-AGI-3 puzzle may require the agent to probe the environment, observe the results of its actions, and build understanding of the transformation rules through interaction before attempting to produce the answer. The information is not given. It must be discovered.

Modeling is the ability to build a world model that represents how the environment works and can predict the results of unseen actions. An agent that genuinely understands a transformation should be able to predict what the output would be for an input it has never seen, not by pattern matching against examples but by having internalized the generative rule. ARC-AGI-3 tasks probe this capability by testing the agent’s predictions on novel inputs after it has explored a limited number of examples. Surface-level pattern extraction produces wrong predictions. Genuine rule induction produces correct ones.

Planning and Execution requires the agent to devise a multi-step action path from the current state to a target state and execute that plan with the ability to adjust when the environment responds unexpectedly. This is the capability that makes ARC-AGI-3 closer to real-world problem solving than its predecessors: in real settings, solutions unfold over time, require iterative correction, and depend on feedback from the environment rather than being computed once from a static input.

Why Test-Time Compute Scaling Cannot Solve ARC-AGI-3

The technique that broke ARC-AGI-2 was extended search: generate many candidate outputs using a reasoning model, score their consistency, and select the most frequent or highest-confidence answer. This approach works when all information needed to solve the problem is present in the static prompt and when the scoring function can evaluate candidate correctness reliably.

ARC-AGI-3 breaks this approach in two ways. First, the exploration requirement means information is not present in the initial prompt. An agent that generates many candidate outputs based on incomplete information will produce many confident wrong answers. The search budget used for scaling test-time compute gets consumed exploring a hypothesis space built on insufficient information, and the most consistent answer in that space is often a consistent wrong answer.

Second, the multi-step execution requirement means the agent must commit to and execute actions in a sequence, observing feedback between steps. A search-over-outputs approach that generates complete solutions from scratch cannot incorporate the feedback from partial execution. The agent needs to act, observe, update its model, and act again, which requires a fundamentally different architecture than token generation with extended sampling.

Program synthesis approaches, another technique that performed well on ARC-AGI-2, face similar limitations. Synthesizing a program that maps input to output works when the transformation rule is fully specified by the examples. When the agent must explore to discover the transformation rule, the program synthesis search space is not well-defined until exploration is complete. The interaction between exploration and synthesis is the hard part, and current synthesis approaches do not handle it well.

What the Current Leaderboard Shows

As of the April 2026 competition status, the Milestone 1 deadline is June 30, 2026, with prizes for the top three scores at that point ($25K, $10K, $2.5K). Published solutions are scoring in the low double digits on the evaluation. The top public solutions use combinations of reasoning chain models for the Modeling component with shallow exploration strategies that probe the environment through random or grid-search action sequences rather than adaptive, model-guided exploration.

The architectures that have outperformed these baselines in early experimentation share one property: they use separate modules for exploration policy and world model construction rather than asking a single language model to perform both functions in its context window. An exploration policy that selects actions to maximize information gain about the transformation rule, feeding observations to a world model that maintains and updates a structured representation of the rule, outperforms a monolithic language model attempting to track all of this in a single generation. This modular architecture connects to the research on agent memory design, where external structured state consistently outperforms in-context memory for complex long-horizon tasks.

What ARC-AGI-3 Reveals About Current Models

The low scores on ARC-AGI-3 are informative about specific capability gaps in current frontier models. The exploration failure mode is the most instructive. Models with strong performance on static reasoning tasks, including Claude Opus 4.6 and GPT-5.4, produce significantly worse results on exploration-required ARC-AGI-3 tasks even when the underlying transformation would be simple to identify given sufficient exploration data. The models can reason about the transformation once they have the data. They cannot efficiently gather the data through interactive exploration.

This gap has a direct analogue in production agent failure research. The context collapse failure mode that accounts for 31% of enterprise agent pilot failures is partly a manifestation of the same limitation: the agent’s model of the task degrades as the task unfolds, and it lacks the adaptive information-gathering behavior needed to maintain an accurate working model over time. ARC-AGI-3 benchmarks this limitation in a controlled, measurable environment. The ICLR 2026 outstanding paper on LLMs getting lost in multi-turn conversation measures the same underlying issue from a conversational angle.

The Grand Prize and the Timeline

The $1 million grand prize goes to the first team that scores 100% on the ARC-AGI-3 evaluation and open-sources their solution. The prize structure includes interim milestones at June 30 and September 30, 2026, with $25,000 for the top milestone scorer. The intent is to incentivize open publication of partial progress rather than waiting for a complete solution before disclosure.

Chollet has been explicit that he does not expect ARC-AGI-3 to be solved quickly. ARC-AGI-1 took several years before solutions began exceeding 60%. ARC-AGI-2 took roughly two years before test-time compute scaling pushed scores above that threshold. ARC-AGI-3 targets capabilities that current architectures lack at a more fundamental level than the prior versions. The Exploration, Modeling, and Planning capabilities it requires are areas of active architectural research rather than capabilities that can be unlocked through better prompting or more compute at inference time.

The competition is live on Kaggle with a public dataset and a private evaluation set. The gap between 87.5% on ARC-AGI-2 and low double digits on ARC-AGI-3 is the gap between what current models can do with extended search and what genuine adaptive reasoning requires. That gap is where the interesting research is being done in 2026.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading