Gemini 3.1 Pro Cut Hallucinations 38 Points Without Learning Anything New. Its Accuracy Actually Went Down.

Gemini 3.1 Pro Cut Hallucinations 38 Points Without Learning Anything New. Its Accuracy Actually Went Down.
Gemini 3.1 Pro Cut Hallucinations 38 Points Without Learning Anything New. Its Accuracy Actually Went Down.

Google’s Gemini 3.1 Pro Preview posted a 38 percentage point reduction in hallucination rate on Artificial Analysis’s AA-Omniscience benchmark in February 2026, dropping from 88 percent on Gemini 3 Pro Preview to 50 percent on the new release. Every outlet covering the launch framed this as the most important change in the model. Towards AI called it the most underappreciated improvement. The benchmark itself confirmed the number.

Almost nobody looked at the accuracy column. Over the same three-month window, Gemini 3.1 Pro’s raw accuracy on the same benchmark went from 56 percent on Gemini 3 Pro Preview to 55 percent on Gemini 3.1 Pro Preview. One point lower. The model knows slightly less than it did in November. It hallucinates dramatically less because it refuses more. The entire improvement is a calibration change.

That distinction matters because the AA-Omniscience Index is designed to reward exactly this behavior. Artificial Analysis built the benchmark to penalize wrong answers as much as right ones and to charge zero penalty for refusals. A model that learns to say I don’t know when it is uncertain wins on the Index without learning anything new. Gemini 3.1 Pro won the Index this way. And on the pure hallucination rate metric, which measures how often the model guesses wrong instead of refusing, it is not even the leader. Grok 4.20 is.

This article reads the primary benchmark data, explains what the Omniscience Index actually measures, walks through the accuracy-versus-calibration distinction, and shows why the Gemini fixed hallucinations headline is closer to Gemini learned to decline questions it would have answered wrong.

What the benchmark measures

AA-Omniscience was published in November 2025 by a team at Artificial Analysis led by Declan Jackson, William Keating, George Cameron, and Micah Hill-Smith. The paper is available on arXiv as 2511.13029 and the public portion of the dataset is hosted on Hugging Face as ArtificialAnalysis/AA-Omniscience-Public. The benchmark consists of 6,000 questions spanning 42 topics across six domains: business, humanities and social sciences, health, law, software engineering, and science and math. The questions were generated by an AI agent against authoritative academic and industry sources, then filtered for unambiguity.

The scoring rule is what sets it apart from most knowledge benchmarks. Each question admits three answers: correct, incorrect, or refusal. The Omniscience Index rewards correct answers with a point, penalizes incorrect answers with a point, and assigns zero to refusals. The raw index ranges from minus 100 to plus 100. A model that guesses randomly on every question lands near zero. A model that only attempts questions it is sure about can post a strongly positive score without knowing much.

That last property is the one that makes the metric interesting and the one that makes the headline number about Gemini 3.1 Pro misleading. A model can improve its Omniscience Index in two different ways. It can learn more facts, which raises accuracy. Or it can get better at knowing when it does not know, which cuts hallucination rate without changing accuracy. The metric does not distinguish between the two. The Artificial Analysis team was explicit about this in their original write-up of the Gemini 3 Pro release in November: accuracy and hallucination rate have little correlation, and the Gemini 3 Pro launch was an accuracy story with no hallucination improvement whatsoever.

The two variables, separately

Here is what the current AA-Omniscience leaderboard shows, pulling directly from Artificial Analysis’s public data as of March 2026.

On Accuracy, the ranking is Gemini 3 Pro Preview (high) at 56 percent, Gemini 3.1 Pro Preview at 55 percent, and Gemini 3 Flash Preview (Reasoning) at 54 percent. The 3.1 release is one point behind its predecessor on raw knowledge. The three Google models cluster tightly, and none of them improved.

On Hallucination Rate, the ranking is Grok 4.20 0309 v2 (Reasoning) at 17 percent, Grok 4.20 0309 (Reasoning) at 22 percent, and Claude 4.5 Haiku (Non-reasoning) at 25 percent. Gemini 3.1 Pro is not in the top three. It sits at 50 percent, 33 points higher than xAI’s reasoning variant.

On the combined Omniscience Index, Gemini 3.1 Pro leads with 33, followed by Gemini 3 Pro Preview (high) at 16, and Grok 4.20 0309 v2 (Reasoning) at 15. The Index favors Gemini because Gemini has the highest accuracy and a reasonable hallucination rate. It is the weighted combination that puts Gemini on top, not either individual metric.

Two things fall out of this. First, Gemini 3.1 Pro’s Index gain of 17 points over Gemini 3 Pro Preview (high) is entirely a calibration story. Accuracy barely moved. Hallucination rate dropped from 88 to 50. The model learned to refuse. Second, if you care specifically about the question how often does this model confidently state something false, Grok 4.20 is the model you want, not Gemini 3.1 Pro. Almost none of the coverage of either model landed on that fact.

What Google likely changed

Google has not published training details for Gemini 3.1 Pro beyond the DeepMind model card, which notes that 3.1 Pro is based on Gemini 3 Pro. The public signal, given the accuracy-versus-calibration split, strongly suggests two specific interventions.

First, calibration-focused post-training. RLHF and constitutional AI style reward models can be tuned to penalize confident wrong answers more than they penalize appropriate refusals. This is a post-training technique that does not require the model to learn new facts. It requires the reward model to punish hallucination differently. The Anthropic line of work on honesty-tuned reward models and the separate literature on I don’t know supervised examples both produce exactly this signature: accuracy flat, refusal up, hallucination rate down.

Second, reasoning-mode abstention. Artificial Analysis separately tests Gemini 3.1 Pro’s thinking mode against its non-thinking mode. The granular thinking parameter added in 3.1 (low, medium, high) lets the model spend more tokens on a question before committing. A model that spends more inference-time compute on a hard question can recognize its own uncertainty better than a model that must answer in one pass. The compounding returns on abstract reasoning tasks that Artificial Analysis’s team flagged in the ARC-AGI-2 trajectory apply to calibration for the same reason: more internal deliberation produces better uncertainty estimates.

Neither of these interventions teaches the model new facts. Both improve it on the AA-Omniscience Index without changing what it knows. That is the mechanism the headline numbers hide.

Why Grok 4.20 wins on hallucination rate

xAI’s Grok 4.20 reasoning variants show the complementary pattern. On accuracy, Grok lags Gemini. On pure hallucination rate, Grok is at the top of the leaderboard. The explanation is similar in structure: the multi-agent reasoning loop gives the model multiple internal perspectives to compare before committing, which is what makes calibration work. A leader agent that receives conflicting sub-agent answers on a factual question has more signal to decide we don’t actually know this than a single model producing a single pass.

The full mechanism behind the multi-agent architecture, including the 4-versus-16 agent knob, the encrypted scratchpad state, and the production constraints that make it difficult to drop into existing stacks, is covered separately in the Grok 4.20 multi-agent architecture piece. The relevant fact for this article is that xAI’s reasoning variants achieve hallucination rates more than 30 points lower than Gemini’s, on the same benchmark, with a different calibration mechanism.

This is awkward for the Gemini solved hallucination narrative. If hallucination is the thing you care about and you are willing to accept a lower accuracy ceiling in exchange for a model that reliably declines what it does not know, Grok 4.20 is measurably better. If you want a model that knows more things and refuses appropriately, Gemini 3.1 Pro is the Index leader. The two models solve different parts of the same problem.

What the three Google models tell us

The accuracy column contains a second detail that matters. Gemini 3 Pro Preview, Gemini 3.1 Pro Preview, and Gemini 3 Flash Preview (Reasoning) are within one point of each other on raw knowledge: 56, 55, 54 percent. Gemini 3 Flash is a smaller model. Gemini 3.1 Pro is a later release. The cluster suggests that Google’s knowledge ceiling has plateaued across the current Gemini 3 series. The architecture’s factual recall is bounded, the scaling gains from adding parameters or training compute are small, and the visible improvements in the Gemini 3.1 Pro release are concentrated in calibration, reasoning depth, and tool use rather than in new facts learned.

Artificial Analysis noted in November that factual recall correlates closely with model size on AA-Omniscience but hallucination rate does not. The corollary is that cutting hallucination is a training-procedure problem, not a scale problem. Any lab can do it if they are willing to trade attempt rate for precision. Google did. xAI did. Anthropic did it for Claude 4.1 Opus before anyone else, which is why Claude 4.1 Opus held the top of the Omniscience Index before Gemini 3 Pro arrived.

What this changes for practitioners

If you are evaluating frontier models for a production deployment where factual reliability matters, the takeaway is that the AA-Omniscience Index is not a single ranking. It is two rankings combined with a weighting rule. You should pull the separate accuracy and hallucination columns before choosing a model.

For knowledge-heavy tasks where you can tolerate a refusal, Gemini 3.1 Pro is the strongest choice. Its refusals mean your application has to handle I don’t know responses gracefully, and if you cannot, the 55 percent accuracy ceiling becomes your real ceiling. For tasks where a confident wrong answer is worse than a refusal, Grok 4.20 reasoning variants are the strongest choice. For tasks where cost is the binding constraint, Claude 4.5 Haiku at a 25 percent hallucination rate and much lower cost is worth measuring against both.

The broader context, covered separately in the three-way architecture comparison between GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro, is that frontier models are now differentiating on specialized axes rather than converging on a single ranking. Gemini 3.1 Pro’s calibration win is real. It is also narrower than the headlines suggest. Knowing what the benchmark rewards is the only way to read the result honestly. The answer is not that Gemini learned more. The answer is that Gemini learned to stop guessing.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading