GLM-5.1 Ran Autonomously for 8 Hours Across 6,000 Tool Calls. How It Beat Claude Opus 4.6 on SWE-Bench Pro and Lost on Verified.

Z.ai released GLM-5.1 open-source under the MIT license on April 7, 2026. The 744-billion parameter Mixture-of-Experts model scored 58.4 on SWE-Bench Pro, beating Anthropic’s Claude Opus 4.6 at 57.3 and OpenAI’s GPT-5.4 at 57.7. On a separate test, it ran 655 iterations of autonomous optimization against VectorDBBench, executed more than 6,000 tool calls without human intervention, and finished at 21,500 queries per second. That number is six times the best single-session result from any other model, Claude Opus 4.6 included.

On SWE-Bench Verified, the older and more widely cited coding benchmark, GLM-5.1 scored 77.8. Claude Sonnet 4.6 scored 79.6. Claude Opus 4.6 scored 80.8. Same model, opposite ranking. The contradiction is not a bug in either benchmark. It is a feature of how Z.ai optimized the post-training pipeline and a warning that leaderboard numbers in April 2026 depend almost entirely on which test rig you pick.

Here is what actually happened during the 8-hour autonomous run, why the two benchmarks disagree, and what developers should do about it.

The 8-hour autonomous run

VectorDBBench is one of the stress tests Z.ai built into its GLM-5.1 evaluation suite. The methodology is specific. The model receives a Rust skeleton for a vector database and empty implementation stubs. It then uses tool-call-based agents to edit code, compile, run benchmarks, and profile the results. Each iteration represents one autonomous cycle of decision, action, and observation.

GLM-5, the base model released on February 11, 2026, plateaued at 3,547 queries per second. GLM-5.1 kept going.

At iteration 90, the model autonomously shifted strategy. It moved from full-corpus scanning to IVF cluster probing with f16 vector compression. That single decision reduced per-vector bandwidth from 512 bytes to 256 bytes and jumped performance to 6,400 QPS. At iteration 240, the model introduced a two-stage pipeline of u8 prescoring and f16 reranking, reaching 13,400 QPS. By iteration 655, the system had settled at 21,500 QPS. Every optimization was independently audited to confirm it worked on arbitrary new inputs and did not exploit benchmark-specific quirks.

KernelBench tells the same story in a different domain. GLM-5.1 delivered a 3.6x geometric mean speedup across 50 GPU kernel problems, continuing to make progress past 1,000 tool-use turns. Claude Opus 4.6 leads this benchmark at 4.2x, but its improvement plateaus earlier. The gap between the two narrows as session length increases. For an 8-hour run, the productive horizon is what matters, and GLM-5.1 extended it further than any previously measured open model. Z.ai’s technical report, “GLM-5.1: Towards Long-Horizon Tasks,” describes the pattern as an autonomous experiment, analyze, and optimize loop in which the model proactively runs benchmarks, identifies bottlenecks, adjusts strategies, and improves iteratively.

MWW covered the related finding that a single edit-tool change improved 15 LLMs at coding by up to 60 percentage points. The GLM-5.1 result suggests the test environment and the post-training are interacting: the model was optimized for long-horizon stability, and the evaluation measured long-horizon stability.

Why long-horizon stability is the harder problem

A 30-second code completion lives or dies on a single forward pass. An 8-hour autonomous run lives or dies on the cumulative probability of not losing the plot across thousands of decisions. The failure modes are different. Short sessions fail on knowledge gaps, hallucination, or tool-call syntax errors. Long sessions fail in three distinct ways. The first is goal drift, where the model forgets the original objective. The second is strategy oscillation, where the model switches between incompatible approaches. The third is error accumulation, where small mistakes compound until the state is unrecoverable.

Z.ai’s technical report attributes GLM-5.1’s extended horizon to post-training decisions aimed at three targets. First, goal alignment is reinforced explicitly during post-training rather than being inherited from pretraining. Second, scratchpad state is managed across tool calls rather than regenerated each time, which reduces the cost of remembering prior decisions. Third, the model is trained to evaluate its own intermediate progress against the original objective, which creates a built-in checkpoint mechanism. None of these are architectural changes from GLM-5. They are post-training behavior shifts layered on the same 744B-parameter MoE base.

The scaffolding on the other side of the API call matters equally. A formal source-code analysis of Claude Code published on arXiv on April 14, 2026 found that 98.4 percent of Claude Code’s 512,000-line codebase is operational infrastructure, not AI decision logic. The five-layer compaction pipeline, the append-only session storage, and the permission gating that survives across multi-hour sessions are what keep the model from drifting even when the underlying weights are the same. GLM-5.1 and Claude Opus 4.6 can both run for 8 hours because the model post-training supports it, but only when the framework around them refuses to let context collapse. Goal drift and error accumulation are joint failures of the model and its harness, not model failures alone.

The practical consequence: for an agent operator building workflows that run unattended, model selection is now a function of how long the task is expected to run. Under 30 minutes, Claude Opus 4.6’s raw reasoning quality wins. Over 4 hours, GLM-5.1’s drift resistance starts to matter more than raw capability.

Why SWE-Bench Pro and SWE-Bench Verified disagree

The two benchmarks measure different things. SWE-Bench Verified is a curated set of GitHub issues where the problem statement, test cases, and acceptance criteria were validated by human reviewers to be unambiguous. The evaluation uses a fixed instruction prompt. Models get one shot at each issue, with no iteration. The benchmark rewards tight, correct, single-pass problem solving.

SWE-Bench Pro is the newer benchmark Z.ai cites for its top-line score. It uses a 200,000-token context window, allows tailored instruction prompts, and tests real-world industrial code repair on larger repositories. It rewards extended context use, prompt engineering, and iterative repair within a session. GLM-5.1 optimized its post-training for this profile. Claude Opus 4.6 optimized for the Verified profile.

The evaluation framework matters as much as the model. On Terminal-Bench 2.0, GLM-5.1 scores 63.5 when measured with the Terminus-2 framework. It scores 66.5 with the Claude Code framework. Three-point swing, same task, same model, different test environment. Claude Code is tuned to Claude’s tool-call patterns, and GLM-5.1 inherits the lift because its tool-call format is compatible. Developers reading benchmark numbers in April 2026 need to ask three questions: which framework, which prompt, which context length. Any of those three variables alone can produce a multi-point swing.

Z.ai reports an internal coding score of 45.3 against Claude Opus 4.6 at 47.9 on its own proprietary benchmark. The methodology uses Claude Code as the framework, which favors Claude’s tool-call conventions. That GLM-5.1 reached 94.6 percent of the Opus score on an away-game setup is either a sign the model is genuinely close or a sign the benchmark needs an independent replication. Both readings are open.

The hardware story nobody is calling out

GLM-5 and GLM-5.1 were trained on 100,000 Huawei Ascend 910B chips using the MindSpore training framework. Zero NVIDIA GPUs. Z.ai was placed on the US Entity List in January 2025, which restricted the company’s access to American silicon.

A model trained entirely on non-NVIDIA hardware scoring within 1.1 points of Claude Opus 4.6 on SWE-Bench Pro contradicts a load-bearing assumption in Western AI discourse: that frontier model training requires NVIDIA. The assumption was reasonable twelve months ago. It is no longer reasonable. Chinese labs have now demonstrated a validated post-training pipeline on domestic silicon, and the result is a model that open-weight US competitors cannot match on the Pro benchmark. The geopolitical implication extends beyond Z.ai. Any future US export control aimed at restricting Chinese AI capabilities must account for the fact that the restricted path has already produced a competitive model.

What developers should actually do with this

GLM-5.1 costs $1.40 per million input tokens and $4.40 per million output tokens via the Z.ai API. A cache discount brings repeated input to $0.26 per million. Off-peak promotional pricing through April 2026 lets developers use standard rates during Beijing off-peak hours. The GLM Coding Plan subscription starts at $3 per month at promotional pricing and $10 standard. Compare to Claude Opus 4.6 at $15 per million input tokens and $75 per million output. The input cost ratio is roughly 10x cheaper on GLM-5.1. The output cost ratio is 17x cheaper.

Compatibility is already broad. GLM-5.1 plugs into Claude Code, OpenCode, Kilo Code, Roo Code, Cline, and Droid as a drop-in model via the GLM Coding Plan. The API is OpenAI-compatible, which means existing routing layers work without modification. Perplexity’s Computer product already routes across 19 models, and a GLM-5.1 addition is trivial. Grok 4.20’s multi-agent architecture offers another orchestration pattern for teams combining open and closed models.

Self-hosting requires 8 H100 GPUs or equivalent at minimum. The FP8 quantized version roughly halves memory requirements. Local inference frameworks vLLM and SGLang both support GLM-5.1 natively.

The practical use case is long-horizon iterative work. Database tuning, kernel optimization, large refactors, and any task where drift over 1,000+ tool calls destroys the session. For reasoning-heavy single-shot tasks, Claude Opus 4.6 still leads. GPQA-Diamond gap is 8 points in Claude’s favor. BrowseComp gap is 16 points. For fast single-shot code completion, GLM-5.1 is the slowest model in the comparison at 44.3 tokens per second.

Limitations

The base model’s SWE-Bench Pro number is externally validated. The internal 45.3-versus-47.9 comparison is self-reported and not independently replicated as of April 13, 2026. Z.ai has a track record of internal numbers holding up under scrutiny, since GLM-5’s SWE-Bench Verified score of 77.8 was externally confirmed to be the highest open-source score on that benchmark, but treat the 94.6 percent figure as a preliminary claim until third-party labs publish.

Context window is 200,000 to 256,000 tokens depending on configuration, compared to 1 million on Claude Opus 4.6. Multimodal input support is absent. Peak-hour quota on the Coding Plan consumes at three times the standard rate during Beijing afternoon hours, which turns a $3-per-month plan into a much steeper effective cost for developers in incompatible time zones.

The MIT license is real and enforceable, but Chinese regulatory overlay on foundation-model deployment creates a separate risk axis for production users outside China. US enterprise legal teams will treat a Chinese-trained, Chinese-hosted model differently from a US-trained alternative, regardless of license terms. Self-hosting bypasses the regulatory question but does not address provenance concerns about training data.

What happens next

Anthropic’s unreleased Claude Mythos Preview reportedly scores 77.8 on SWE-Bench Pro. That is 19.4 points ahead of GLM-5.1. If the cadence of recent releases holds, that gap closes in months, not years. Z.ai shipped GLM-5 on February 11, Turbo on March 15, the GLM-5.1 API on March 27, and the open weights on April 7. Four releases in two months. GPT-5.4 and Gemini 3.1 Pro both have coding-specific responses planned for the second quarter of 2026.

The benchmark contradiction at the heart of this story foreshadows the rest of 2026. Leaderboard rankings will fragment by framework, prompt, and context length. Vendors will ship self-scored benchmarks on their own test rigs. Developers will need their own evaluation pipelines on their own code to decide which model to deploy. A single authoritative benchmark number is becoming less useful by the month. Both of GLM-5.1’s headline numbers, 58.4 on Pro and 77.8 on Verified, are correct. They just answer different questions.

GLM-5.1 Ran Autonomously for 8 Hours Across 6,000 Tool Calls. How It Beat Claude Opus 4.6 on SWE-Bench Pro and Lost on Verified.

The 8-hour autonomous run

Why long-horizon stability is the harder problem

Why SWE-Bench Pro and SWE-Bench Verified disagree

The hardware story nobody is calling out

What developers should actually do with this

Limitations

What happens next

Share this:

Like this:

More posts

MITRE ATLAS: The ATT&CK Framework for AI Systems

Neural Backdoor Attacks: From BadNets to LLM Trojans

LLM Watermarking: How Models Embed Detection Signals in Their Outputs

Differential Privacy for LLMs: The Training Privacy Guarantee

Discover more from My Written Word