
The single most repeated fact about xAI’s Grok 4.20 multi-agent release is false. Every major outlet covering the February 17, 2026 launch, and most of the follow-up coverage through March, describes four specialized AI agents named Grok, Harper, Benjamin, and Lucas that think in parallel and debate each other before synthesizing a response. The names come from an early speculation post on X. They are nowhere in xAI’s official documentation, nowhere in the xAI SDK, nowhere in the API schema, and nowhere in the model card.
What xAI actually shipped is architecturally different from the parliament-of-four story. The model ID is grok-4.20-multi-agent. It is configurable at either 4 or 16 agents via a single parameter. One leader agent orchestrates the rest. Sub-agent intermediate state is encrypted and not returned to the caller by default. The model does not support the OpenAI Chat Completions API. It does not accept client-side function calling or custom tools. It ignores max_tokens. These are real production constraints that determine whether you can drop this into an existing agent stack, and almost nobody covering the launch has mentioned them.
This article reads the documentation the way a developer would. It corrects the agent-name error, explains the leader-orchestration mechanism, walks through the 4-versus-16 configuration, covers the pricing math, and ends with the benchmarks and limits that actually matter.
The architecture xAI published
The grok-4.20-multi-agent model is available through xAI’s Responses API and via the xAI SDK. The documentation describes it as Realtime Multi-agent Research and frames it as an orchestration pattern rather than a new base model. In the docs’ own words: when you send a request to the multi-agent model, multiple agents are launched to discuss and collaborate on your query. Each agent contributes its own perspective, reasoning, and findings. A designated leader agent is responsible for synthesizing the discussion and presenting the final answer back to you.
That is the entire described mechanism. There is no list of named personas. There are no fixed specializations. There is a leader and some sub-agents, and the number of sub-agents is a configuration parameter.
xAI exposes the agent count through two compatible spellings. Callers using the xAI SDK set agent_count directly to 4 or 16. Callers using the OpenAI-compatible Responses API or the Vercel AI SDK set reasoning.effort to "low" or "medium" for 4 agents, or "high" or "xhigh" for 16. Every other value is rejected.
The 4-agent setup is positioned for focused queries. The 16-agent setup is positioned for multi-faceted research. xAI’s own documentation flags the trade directly: more agents means deeper analysis at the cost of higher token usage and latency.
Encrypted scratchpad state
The output behavior matters because it determines what you can audit and what you pay for.
By default, only two things come back from a multi-agent request. The leader agent’s final text. And any server-side tool calls the leader made. Everything the sub-agents thought, searched, cited, or debated is encrypted and discarded from the visible response. The docs are explicit: all sub-agent state, including their intermediate reasoning, tool calls, and outputs, is encrypted and included in the response only when use_encrypted_content is set to True in the xAI SDK.
Setting use_encrypted_content=True returns an opaque blob that you cannot read but that you can pass back into the next turn of a multi-turn conversation. The blob preserves the full deliberation context so the agents can continue their work on a follow-up query. If you do not pass it back, the next turn starts cold.
This is an unusual trust model. A developer watching a sub-agent debate over a production task cannot see what the sub-agents actually said. They get the leader’s synthesis and a bill for all the reasoning tokens spent underneath. If the leader hallucinates something that one sub-agent correctly flagged, there is no straightforward way to catch it from the outside. The encrypted blob gives xAI plausible forward compatibility but gives the caller zero inspection.
Server-side tool loop
The multi-agent variant runs its tools on xAI’s servers. When you enable a tool like web_search, x_search, code_execution, or collections_search, the server performs the full agent loop without returning control to the client until the final answer is generated. This is the opposite of the client-side function calling pattern that most OpenAI-compatible integrations assume.
The consequences for developers are concrete. Client-side function calling is not supported on grok-4.20-multi-agent. Custom tools defined by the caller are not supported. The only tools the agents can use are the ones xAI hosts. Remote MCP tools are supported because they live on a server the model can reach over HTTP. Local Python functions exposed through OpenAI-style tool schemas are not.
Two additional constraints make production integration trickier than the Grok API docs for single-agent Grok 4.1 would suggest. The Chat Completions API is not supported. You must use the Responses API or the xAI SDK. And max_tokens is silently ignored. There is no way to cap output length from the client side. If you need a short answer, you ask for one in the prompt and hope the leader complies.
The pricing math the debate narrative hides
xAI’s base Grok 4.20 pricing is competitive at $2 per million input tokens and $6 per million output tokens. The multi-agent variant is listed at $10 per million input and $50 per million output on OpenRouter and third-party resellers. That is roughly 5 times the base input price and more than 8 times the base output price.
The reason is that every token consumed by both the leader agent and the sub-agents is billed. Server-side tool calls made by any agent are billed at the same tool-use rates as a standard request. A single 16-agent query that does deep web search and code execution can legitimately consume tens of thousands of tokens across 17 model instances, plus tool-use surcharges. xAI’s documentation says so directly: because multiple agents may run in parallel and each can independently invoke tools, a single multi-agent request may use significantly more tokens and tool calls than a standard single-agent request.
The debate narrative, where four named agents peer-review each other for free, obscures the cost reality. This is closer to paying for 17 instances of a frontier model on every hard query.
What the benchmarks actually show
xAI’s Alpha Arena result from January 2026, covered by Next Big Future, put a pre-release Grok 4.20 configuration at the top of a live stock-trading competition. The model turned $10,000 into between $11,000 and $13,500 across runs, with optimized configurations pushing to 34 to 47 percent returns. This is genuine and interesting, though it also reflects a specific task type that rewards fast iteration over real-time data, which is exactly what the multi-agent architecture with x_search is built for.
The publicized benchmark numbers are strong but uneven. Grok 4.20 hit 93.3 percent on AIME, a mathematical reasoning test. On Artificial Analysis’s AA-Omniscience hallucination benchmark, it posted a 78 percent non-hallucination rate, the highest any model has scored on that test. GPQA Diamond at 78.5 percent and MATH-500 at 87.3 percent put it in the top tier. The 2 million token context window matches or beats Claude Opus 4.6 for long-horizon tasks.
The Artificial Analysis hallucination result turned out to matter more than the headline framing suggested. Grok 4.20 reasoning variants now hold the lowest hallucination rate on the current AA-Omniscience leaderboard, at 17 percent. Gemini 3.1 Pro’s widely reported 38-point reduction left it at 50 percent, still 33 points higher than Grok’s reasoning variant. If the thing you care about is how often a frontier model confidently states something false, Grok 4.20 is the measurable leader, not Gemini.
Where Grok 4.20 lags is on the enterprise-task benchmarks that Claude Sonnet 4.6 dominates. SmartScope and Artificial Analysis both noted that GDPval-style Elo evaluations for financial, legal, and expert-professional tasks do not show Grok 4.20 competing at the top, which tracks with a training data mix heavy on X and light on regulated-industry corpora.
For readers comparing it to the current frontier, the three-way context on GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro architecture differences gives the honest positioning. Grok 4.20 is now a credible fourth in the race, with an orchestration trick that the other three have not productized in the same way.
The limits that matter
The limits xAI declared in the beta documentation are not minor. They are architectural, and most of them are not going away with the next point release.
Only the leader agent output is exposed. Sub-agent reasoning is encrypted and inaccessible even to the developer paying for it. This makes auditing the model’s reasoning for a production deployment harder than auditing a single-model request.
No client-side function calling. No custom tools. If your agent stack depends on calling local Python functions or proprietary internal APIs through tool schemas, you cannot use grok-4.20-multi-agent for those tasks. You can fall back to single-agent Grok 4.1 Fast for the rest.
No Chat Completions API. This breaks a large class of existing integrations that assume the OpenAI chat interface. Migrations to the Responses API are not trivial for codebases with complex conversation state handling.
No max_tokens. There is no mechanical way to bound cost or output length from the client. Budget guardrails have to happen at the billing layer.
And because the benchmark spread is uneven, the model’s real strengths are on tasks that benefit from parallel web research and debate-style synthesis. It is not a drop-in upgrade for coding agents that need tight tool loops over local code, and it is not an obvious fit for regulated-industry deployments where the encrypted-state trust model is itself a compliance question.
What this sets up
The interesting thing about Grok 4.20 multi-agent is not that it invented multi-agent orchestration. Research labs have been publishing on multi-agent debate, Mixture of Agents, and verifier-augmented decoding for over a year. What xAI did was ship the first productized, priced, server-side multi-agent endpoint from a frontier lab. Anthropic’s Claude sub-agents, OpenAI’s parallel function calling, and Google’s Gemini Deep Research each hint at similar patterns, but none of them expose a single model ID with a configurable agent count and a published 4-versus-16 knob.
Meta shipped the competing bet two months later. Its Muse Spark Contemplating mode, released April 8, 2026, spawns parallel subagents inside a single model rather than across replicas of the same model. The choice between in-model parallelism and replica parallelism is now one of the live architectural debates among frontier labs. Grok 4.20 is the first commercial endpoint to ship the replica variant at scale, and Perplexity Computer ships a third variant that orchestrates across entirely different models from different labs. Three architectures, six weeks apart, solving the same problem with fundamentally different mechanisms.
If this pattern works commercially, the next round of frontier models from other labs will likely ship something that looks similar. The real question for developers is whether the encrypted-scratchpad trust model becomes the norm. For Anthropic’s Claude, where the .claude folder protocol exposes every sub-agent’s memory as inspectable files, the answer is probably no. For xAI, it already is.
The four named debating agents of the Grok 4.20 launch coverage were a story that wrote itself and wrote itself wrong. The architecture underneath is less charming and more constrained, and the production trade-offs are exactly the ones you would expect from the first lab to ship this pattern behind a paywall. The documentation has been public since the beta launched. It is still the only place to read what was actually shipped.