One Developer Improved 15 LLMs at Coding by Changing the Edit Tool. Grok Went From 6.7% to 68.3%.

In February 2026, security researcher Can Boluk changed a single variable in his open-source coding agent and re-ran a benchmark across 16 language models. Grok Code Fast 1 jumped from 6.7% to 68.3% success rate. Grok 4 Fast cut its output tokens by 61%. Gemini 3 Flash gained 5 percentage points over Google’s own best result. No model weights were modified. No prompts were rewritten. The only thing that changed was how the agent told the model to edit a file.

The result exposes a problem the AI coding industry would rather not talk about. The conversation around tools like Claude Code, GitHub Copilot, and Cursor focuses almost entirely on which model is smartest. Boluk’s benchmark shows that the infrastructure between the model’s output and the actual file change is where most failures happen. Models are not flaky at understanding code. They are flaky at expressing edits in the format the tool demands.

Three Edit Formats, Three Failure Modes

Every AI coding tool needs to solve a deceptively simple problem: the model decides what code to change, and the tool applies that change to a file. The industry has converged on three approaches, and each one breaks in a different way.

apply_patch (OpenAI Codex): The model outputs an OpenAI-flavored diff as a raw string. OpenAI likely biases the token selection process to fit this structure for Codex-variant models. But hand this format to any model that was not specifically trained on it and patch failures spike. In Boluk’s benchmark, Grok 4 had a 50.7% patch failure rate. GLM-4.7 hit 46.2%. These are capable models producing broken output because they do not speak the format.

str_replace (Claude Code and most others): The model finds exact old text and swaps in new text. Conceptually simple. But the model must reproduce every character of the old string perfectly, including whitespace and indentation. If the old string appears more than once, the edit is rejected. The “String to replace not found in file” error is so common in Claude Code that it has its own GitHub megathread with 27 linked issues. Gemini’s implementation adds some fuzzy whitespace matching, but the core problem persists: the model is burning tokens to reproduce content it already saw, and any recall error kills the edit. For the full mechanical breakdown of why str_replace fails in Claude Code specifically, MWW published a companion piece on the three root causes of the “String to replace not found in file” error and the 30-second diagnostic protocol that maps each cause to its fix.

Neural merge (Cursor): Cursor deployed a separate fine-tuned 70B-parameter model whose only job is to take a draft edit and merge it into the file correctly. The fact that one of the best-funded AI coding companies threw an entire large model at this problem tells you how hard it is. Even then, Cursor’s own blog post acknowledges that fully rewriting the entire file outperforms their diff approach for files under 400 lines.

Prior research confirmed the pattern. Aider’s benchmarks showed that format choice alone swung GPT-4 Turbo’s success rate from 26% to 59%. JetBrains’ Diff-XYZ benchmark found that no single edit format dominates across models. EDIT-Bench found that only one model achieves over 60% pass@1 on realistic editing tasks. The common thread: the bottleneck is not intelligence. It is the mechanical act of expressing a change.

How Hashline Works

Boluk’s solution, Hashline, attacks the root cause. When a model reads a file in the Hashline format, every line comes back tagged with a 2-3 character content hash:

1:a3|function hello() {
2:f1|  return "world";
3:0e|}

When the model edits, it references those tags: “replace line 2:f1” or “replace range 1:a3 through 3:0e, insert after 3:0e.” The model does not need to reproduce the old content. It does not need to match whitespace. It points at lines using a verifiable identifier, specifies the new content, and the tool handles the rest.

If the file changed since the last read, the hashes will not match, and the edit is rejected before anything gets corrupted. This is a concurrency safety mechanism that neither apply_patch nor str_replace provides. The model proves it knows what it is editing by recalling the hash, not by reproducing the entire old string.

The technique eliminates two failure modes at once. It removes the perfect-recall requirement that causes str_replace failures, and it removes the format-specific training requirement that causes apply_patch failures on non-OpenAI models. The hash is model-agnostic. Any model that can recall a short alphanumeric tag can use it.

The Benchmark Numbers

Boluk ran 180 tasks per model, 3 runs each, across 16 models and 3 edit formats (apply_patch, str_replace, Hashline). Tasks were generated by introducing mechanical bugs into real files from the React codebase: operator swaps, boolean flips, off-by-one errors, removed guard clauses. Each task was a fresh agent session with four tools: read, edit, write, and a description of the bug in plain English.

The results across models:

Grok Code Fast 1

6.7% to 68.3%

10x improvement

Grok 4 Fast tokens

-61%

output reduction

Gemini 3 Flash

78.3%

+5pp over Google

MiniMax

2x+

success rate doubled

The pattern is consistent: the weakest models gained the most from the format change because their failures were overwhelmingly mechanical, not cognitive. They understood the bug. They knew the fix. They could not express the edit in a format that the tool would accept. Hashline removed that barrier.

A replication attempt by another developer on DEV Community tested Hashline against str_replace across Python, TypeScript, and Rust with different models. The results were mixed: Python penalized Hashline slightly, TypeScript was neutral, Rust was a toss-up. The replicator noted that Boluk’s benchmark used JavaScript files from the React codebase with an LSP feedback loop, which provides type errors for retry. This interaction between edit format and feedback loop likely confounded some gains. The replication confirms that edit format matters, but the magnitude of improvement depends on language, model, and feedback mechanisms.

The Vendor Lock-In Problem

Boluk’s research was not just a benchmark. It was a policy argument. While running the experiments, two things happened. Anthropic blocked OpenCode, a popular open-source coding agent, from accessing Claude through Claude Code subscriptions. And Google disabled Boluk’s Gemini account entirely for running the benchmark that showed their own model improving by 5 points.

MWW has reported on Anthropic’s subscription pricing changes that separated first-party and third-party usage. The technical reason is a real cost asymmetry: prompt caching makes first-party usage roughly 90% cheaper. But the effect is the same: third-party tools face higher costs and restricted access.

The incentive problem is structural. No vendor will optimize their edit tool for competing models. Anthropic will not tune str_replace for Grok. xAI will not tune apply_patch for Gemini. OpenAI will not tune for Claude. But an open-source agent, maintained by contributors who use different models, optimizes for all of them because each contributor fixes the failures they personally encounter.

When Perplexity launched Computer as a 19-model orchestration system, it acknowledged this reality implicitly: the best system is model-agnostic. Boluk’s work shows that model-agnostic engineering is not just a business strategy. It is where the highest-return performance improvements live.

An 8% improvement in Gemini’s success rate from changing the edit tool is larger than most model upgrades deliver. It cost $300 in API calls and zero training compute. As Boluk put it: “You’re blaming the pilot for the landing gear.”

What This Means for Developers

The practical takeaway is that before upgrading your model subscription or switching providers, measure your current tool’s edit failure rate. The “String to replace not found” error, the malformed diff rejection, the retry loop that burns tokens and time: these are infrastructure failures, not intelligence failures. A cheaper model with a better edit tool may outperform an expensive model with a broken one.

The data supports this at scale. LangChain’s team separately achieved a 13.7-point improvement on Terminal Bench 2.0, jumping from 30th to 5th on the leaderboard by optimizing only their agent infrastructure without changing models. They used three techniques: better system prompts emphasizing self-verification, improved tool definitions, and smarter context management. Meta Research published a paper on Meta-Harness, an automated system that evolves agent infrastructure using execution traces. It found a 7.7-point improvement over baseline using 4x fewer context tokens.

The open benchmark code lets anyone reproduce Boluk’s results. The feature request to add Hashline to Claude Code (issue #25775) is open and actively discussed. The issue thread reveals that users have already built third-party MCP servers implementing Hashline as a workaround, but the “two tools” problem (the model must be explicitly told to prefer the MCP tool over the built-in str_replace) makes this fragile.

The edit tool problem will be solved. The question is whether it gets solved by one company, in private, for one model, or by a community, in the open, for all of them. Given that Claude Code’s 512,000-line source revealed sub-agent output leaking raw JSONL and wasting hundreds of thousands of tokens, the closed-source approach has not solved it yet either.

Boluk spent $300 on API calls. The result improved 15 models across the board without touching a single weight. Meanwhile, the companies building these tools are spending billions on the next model release. At some point, the industry will notice where the returns actually are.

One Developer Improved 15 LLMs at Coding by Changing the Edit Tool. Grok Went From 6.7% to 68.3%.

Three Edit Formats, Three Failure Modes

How Hashline Works

The Benchmark Numbers

The Vendor Lock-In Problem

What This Means for Developers

Like this:

More posts

Anthropic Mapped 171 Emotion Vectors Inside Claude Sonnet 4.5. Steering Them Causally Changes the Model’s Choices.

ToolHijacker Prompt Injection Hijacks LLM Agent Tool Selection 96.7% of the Time. Every Published Defense Failed.

GLM-5.1 Ran Autonomously for 8 Hours Across 6,000 Tool Calls. How It Beat Claude Opus 4.6 on SWE-Bench Pro and Lost on Verified.

Claude Code “String to Replace Not Found in File”: The Three Root Causes, the Diagnostic Protocol, and the Structural Fix

One Developer Improved 15 LLMs at Coding by Changing the Edit Tool. Grok Went From 6.7% to 68.3%.

Three Edit Formats, Three Failure Modes

How Hashline Works

The Benchmark Numbers

The Vendor Lock-In Problem

What This Means for Developers

Share this:

Like this:

More posts

Anthropic Mapped 171 Emotion Vectors Inside Claude Sonnet 4.5. Steering Them Causally Changes the Model’s Choices.

ToolHijacker Prompt Injection Hijacks LLM Agent Tool Selection 96.7% of the Time. Every Published Defense Failed.

GLM-5.1 Ran Autonomously for 8 Hours Across 6,000 Tool Calls. How It Beat Claude Opus 4.6 on SWE-Bench Pro and Lost on Verified.

Claude Code “String to Replace Not Found in File”: The Three Root Causes, the Diagnostic Protocol, and the Structural Fix

Discover more from My Written Word