The Darwin Gödel Machine Rewrites Its Own Code to Get Better at Coding. Here Is What That Actually Means.

The Darwin Gödel Machine Rewrites Its Own Code to Get Better at Coding. Here Is What That Actually Means.
The Darwin Gödel Machine Rewrites Its Own Code to Get Better at Coding. Here Is What That Actually Means.
SWE-bench
20% to 50%
Polyglot
14.2% to 30.7%
What Changes
Agent Code
What Stays Fixed
The LLM

Sakana AI, the University of British Columbia, and the Vector Institute presented a paper at ICLR 2026 describing the Darwin Gödel Machine (DGM), an AI system that rewrites its own source code to become better at programming tasks. On SWE-bench, a benchmark requiring agents to resolve real-world GitHub issues, DGM improved its own score from 20.0% to 50.0%. On Polyglot, a multi-language coding benchmark, it jumped from 14.2% to 30.7%. These are real performance gains produced by automated self-modification. They are not what the headline “self-improving AI” implies.

What the System Actually Modifies

The DGM does not modify the underlying foundation model. It does not rewrite neural network weights. It does not retrain itself. The system modifies its own Python codebase: the tools, workflows, prompts, and control logic that surround a frozen pretrained language model. The foundation model (Claude 3.5 Sonnet in the primary experiments) stays exactly the same throughout the entire process. The “self” in “self-improving” is the agent environment, not the neural network.

This distinction matters. A system that can rewrite its own scaffolding code to become better at coding tasks is interesting and useful. A system that can rewrite its own neural architecture to become smarter at everything is something else entirely. The DGM is the former, not the latter. The paper’s authors are clear about this. Their framework “envisions agents that can rewrite their own training scripts (including training a new foundation model),” but they explicitly state that retraining models is computationally intensive and left as future work.

How the Self-Modification Loop Works

The DGM alternates between two phases: self-modification and evaluation. During self-modification, the system reads its own Python codebase and proposes changes. These might be adding a new tool (like a patch validator), improving file viewing capabilities, building better editing commands, implementing a system that generates multiple solutions and ranks them, or adding a memory of what has been tried before and why it failed.

During evaluation, the modified agent is tested on coding benchmarks. If the modified version scores better, it gets added to an archive of agents. If it scores worse, it may still be kept if it represents an interesting variation that could lead to future improvements. This is the “Darwin” part: inspired by biological evolution, the system maintains a growing population of diverse agents rather than keeping only the single best performer.

The evolutionary archive is the key innovation. Traditional optimization would keep only the highest-scoring agent and modify from there, risking getting stuck in local optima. The DGM maintains an archive of diverse agents and can branch new modifications from any of them. The paper shows that some low-scoring “ancestor” agents produced descendants that eventually outperformed the best agents found through greedy optimization. The branching exploration, not just the self-modification, drives the results.

The Results Transfer Across Models and Languages

The improvements discovered by the DGM generalize beyond the specific setup used during self-modification. An agent optimized using Claude 3.5 Sonnet also showed improved performance when run with o3-mini or Claude 3.7 Sonnet as the underlying model. A DGM whose self-improvement was guided exclusively by Python tasks showed significant gains on Rust, C++, Go, and other languages in the Polyglot benchmark.

This transferability suggests the DGM is discovering general agent design improvements (better tools, smarter workflows, more effective prompting strategies) rather than model-specific tricks or task-specific overfitting. The improvements work because they change how the agent approaches problems, not because they exploit quirks of a particular model or language.

What the DGM Discovered

The paper documents specific innovations the DGM invented for itself. Early in the SWE-bench run, it developed improved file viewing and editing tools. Later, it discovered a patch generation strategy that creates multiple candidate patches and ranks them by quality before applying the best one. It built a memory system tracking which approaches failed on similar problems. These are the same types of improvements that human developers make when building coding agents by hand, but the DGM found them through automated search rather than human engineering. The four memory patterns used in production agent systems map directly onto the capabilities DGM discovers for itself: in-context memory, session-level tracking, cross-task knowledge retrieval, and action logging for behavioral improvement.

What the DGM Cannot Do

The system requires substantial computational resources. Each self-modification cycle involves running the modified agent on benchmark problems, which means hundreds of API calls to the underlying foundation model per evaluation. The process scales with the number of agents explored and benchmark problems evaluated.

The DGM’s exploration process and archive management are fixed algorithms that the system cannot modify. The agent can rewrite its coding tools, workflows, and prompts, but not the meta-algorithm that governs how self-modification happens. This is a deliberate safety constraint but also a fundamental limitation: the system cannot improve the way it improves. True recursive self-improvement would require the meta-algorithm itself to be subject to modification, which the authors leave as future work.

All experiments ran in sandboxed environments with human oversight. The safety considerations around self-modifying AI are not hypothetical. The DGM’s modifications are constrained to Python code changes evaluated on benchmarks, not arbitrary system-level access. But as these systems become more capable, the gap between “can modify its own coding tools” and “can modify anything” narrows, and the sandboxing requirements become more demanding. The Firecracker-backed microVM isolation model addresses exactly this production sandboxing requirement for deployed coding agents.

Where This Fits in the Research Trajectory

The Gödel Machine concept dates to Jürgen Schmidhuber’s theoretical proposal decades ago: an AI that proves its own modifications are beneficial before applying them. The DGM drops the requirement for formal proof and replaces it with empirical testing, trading theoretical guarantees for practical applicability. Concurrent work by Robeyns et al. (2025) explores a similar concept (single agent recursively modifying itself) but without the DGM’s open-ended archive, which the paper shows is necessary to avoid stagnation.

The practical implication is that automated agent design may soon match hand-designed agents. If the pattern holds, teams building AI coding agents will shift from manually engineering tools and workflows to running DGM-style search over agent designs. The architectural gap between Codex’s cloud loop and Claude Code’s local execution model illustrates how different design philosophies produce measurably different performance profiles. DGM’s automated search is converging on a similar design space through a different path: evolution rather than engineering.

The DGM’s 50% on SWE-bench is not state-of-the-art (hand-designed agents score higher), but the rate of improvement suggests automated search could close that gap as compute budgets and foundation model capabilities increase.

The DGM is not self-improving AI in the science fiction sense. It is automated engineering of AI agent scaffolding, validated by benchmarks, constrained by sandboxes, and limited to the capabilities of its frozen foundation model. That is a more boring description. It is also a more accurate one, and the results it produces are real.

Sources: Zhang et al., arXiv: 2505.22954 (v3, March 2026). Sakana AI official page. GitHub: jennyzzt/dgm. ICLR 2026 poster. Schmidhuber, Gödel Machine (2007). SWE-bench original benchmark.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading