KellyBench: 8 AI Models Bet the Premier League. All Lost Money.

General Reasoning gave eight frontier AI models a virtual £100,000 bankroll, a full season of Premier League data, and one instruction: grow the money. Every model finished in the red. Several went bankrupt. The benchmark is called KellyBench, named after a 1956 formula every model could recite perfectly. None of them could apply it.

The results landed in April 2026 and got coverage everywhere. What the coverage missed is the mechanism. This is not a story about AI being bad at sports betting. It is a story about three specific failure modes that matter far beyond a football season, because they are the exact same failure modes that kill enterprise agent deployments in production.

What KellyBench Actually Measures

The Kelly criterion, invented by Bell Labs physicist John L. Kelly Jr. in 1956, is a formula for optimal bet sizing when you have a calculable edge over a market. The core idea: bet a fraction of your bankroll proportional to your edge divided by the odds. Too small and you leave money on the table. Too large and variance wipes you out before your edge pays off.

KellyBench is not a test of whether AI can predict football results. It is a test of something harder: whether an agent can maintain coherent strategy across 100 to 150 matchdays, adapt as the world changes, and close the loop between its own analysis and its own actions. The environment is adversarial. Odds in a liquid betting market already reflect the crowd’s information. Finding edge requires building models that beat the market, not just models that predict outcomes.

General Reasoning, a London-based AI startup founded by former Meta AI researcher Ross Taylor, constructed the benchmark on the 2023-24 English Premier League season. Each model received detailed historical statistics, lineups, past results, and public odds. No internet access. Three separate runs from a fresh start each time. The evaluation rubric had 44 points, developed with quantitative betting fund experts, covering features, staking discipline, non-stationarity handling, and execution fidelity.

No model scored above a third of available rubric points. Mean final bankrolls ranged from £0 (Grok 4.20, bankrupt all three runs) to £89,035 (Claude Opus 4.6, the best performer, still down 11% on average). OpenAI’s GPT-5.4 lost 13.6% on average. Google’s Gemini 3.1 Pro was violently inconsistent: a 34% profit on one run, bankrupt on another.

Three Failure Modes, Documented in the Traces

The paper and the model traces expose three distinct breakdowns. Each one appears in the agentic deployment literature under different names. KellyBench makes them concrete with specific numbers and specific models.

Failure Mode 1: The Knowledge-Action Gap

GLM-5, Z.ai’s open-weight model, wrote three separate self-critique documents during its run. Each one correctly diagnosed the same problems: a hardcoded 25% draw rate that did not match observed reality, an overestimated home win rate (the model predicted 40%, actual was 30%). At one point, with its bankroll at roughly £44,200, it documented the problem in explicit detail. Then it continued using the same broken parameters.

The model knew what was wrong. It could not act on that knowledge. This is the knowledge-action gap in its clearest form: accurate diagnosis that produces zero behavioral change. GLM-5 could write a consulting report about its own failure while executing the strategy that caused it.

Failure Mode 2: Execution-Intent Divergence

Kimi K2.5, Moonshot’s model, built a mathematically correct fractional Kelly staking function. The formula was right. The code structure was right. Then it sent a broken bash command roughly 50 times in a row. Its reasoning trace noted the problem after the first few failures. Then it sent the identical broken command again, and again.

Eventually, an accidental £114,000 bet on a Burnley versus Luton match closed the position. That was 98% of its remaining bankroll on a single fixture. The model knew what it intended to do. The execution diverged from intent and the model could not detect or correct the divergence, even when the error appeared explicitly in the trace.

This is execution-intent divergence: the agent’s stated plan and its actual behavior are different, and no internal mechanism catches the gap. In production software agents, this failure mode manifests as agents that say they checked a condition and did not, that claim to have written to a file they left empty, or that confirm an action they actually skipped.

Failure Mode 3: Capital-at-Risk Blindness

Google’s Gemini Flash forfeited two of its three runs. On one of them, it identified a betting opportunity with a three-percentage-point historical win-rate edge and placed a wager of roughly £273,000. That was the entire remaining bankroll on a single match. The edge was real by historical average. The position sizing ignored variance entirely. Fractional Kelly would have recommended a few percent. The model bet everything.

The problem is not that Gemini miscalculated. The problem is that it never modeled downside risk as a constraint on behavior. It optimized for expected value while ignoring the probability of ruin. In financial agent deployments, this failure mode appears when models approve purchases, commits, or API calls without accounting for the asymmetric cost of being wrong once versus the benefit of being right repeatedly.

The Full Scoreboard: From Barely Alive to Total Forfeit

The complete results table exposes how wide the performance spread actually is. Arcee Trinity, a mixture-of-experts model designed for agentic tasks, failed to place a single bet in two of its three seeds. The benchmark rules count this as a forfeit and a total loss of bankroll. On the third seed, it failed to finish before the season ended, leaving £15,773 remaining when it stopped. The model did not fail at betting strategy. It failed to engage with the task at all.

Grok 4.20 went bankrupt on one seed and failed to finish on the other two, also counted as forfeits, with pre-forfeit bankrolls of £25,923 and £9,518. Only three of 24 model-seed combinations across the entire evaluation achieved a positive return on investment.

The diversity of failure modes is as instructive as the aggregate numbers. Arcee Trinity failed to initiate. Grok failed by overcommitting and collapsing. Gemini failed by a single catastrophic position. Kimi failed by execution-divergence despite correct reasoning. GLM-5 failed by diagnostic paralysis. GPT-5.4 mostly avoided failure by mostly avoiding action. Claude Opus 4.6 was the only model demonstrating something resembling disciplined execution across the full season, and it still finished 11% below starting capital.

The benchmark also exposed a systematic miscalibration pattern that cut across multiple models: consistent overestimation of draw probabilities and longshots, and an inability to handle newly promoted teams with limited historical data. These teams have no deep historical record. Models trained to extrapolate from data simply had nothing to work with for Burnley, Luton, and Sheffield United’s first returned season in years. A human analyst would recognize this as a data gap and adjust position sizing accordingly. Most models did not.

What GPT-5.4 Got Right, and Why It Still Lost

GPT-5.4 was the most methodical model tested. It spent 160 tool calls building predictive models before placing a single bet. It then calculated its own log-loss (0.974) against the market’s implied log-loss (0.971) and correctly concluded it had no meaningful edge. For the rest of the season, it placed near-zero bets to preserve capital. Final average loss: 13.6%.

Sound reasoning. Correct conclusion. But a 13.6% loss. The friction of running the benchmark, combined with one seed where small systematic losses compounded, meant even the best-reasoned strategy could not break even. One GPT-5.4 seed cost roughly $2,012 in inference to run a single episode.

The researchers note this is instructive rather than a flaw. A highly efficient betting market like the Premier League is deliberately constructed to defeat systematic edge-seeking. The correct answer, in many seeds, is probably to not bet at all. Most models never considered that option. They had a task and executed it, even when the task had a negative expected value.

Why This Maps to Production Agent Deployments

Software engineering benchmarks like SWE-bench Verified operate in static environments. The problem is fixed, the solution is checkable against unit tests, and the agent gets one shot. By early 2026, top frontier models were resolving more than 80% of real GitHub issues on the benchmark.

KellyBench is the opposite: 100 to 150 sequential decisions, a world that changes every matchday, feedback that arrives days after actions are taken, and a market that adapts to edge-seeking behavior. The benchmark consumed 500 to 900 tool calls and 30 to 500 million tokens per episode. No existing SWE-bench score predicts performance here.

This gap matters for any team deploying agents in 2026. An agent that scores 80% on SWE-bench and fails KellyBench-style tasks has real capability in narrow, well-specified domains. It will likely fail in any workflow where the problem specification changes during execution, feedback is delayed or noisy, actions have compounding consequences, or maintaining a consistent strategy across many decisions is required. Those are the exact conditions in most business-critical automation: customer service agents handling escalating situations, financial reconciliation agents dealing with live data, infrastructure agents responding to incidents.

The Air Street May 2026 State of AI report documented a related failure: Opus 4.6 agents systematically out-negotiated Haiku 4.5 counterparts in simulated markets, with owners of the weaker agents unaware of their disadvantage. Better models extract hidden premiums in dynamic environments. KellyBench shows even the best models fail the environment itself when it is sufficiently non-stationary. This aligns with the 86% enterprise agent pilot failure rate documented across multiple 2026 studies, where long-horizon coherence was the most common root cause.

The Sophistication Score Reveals the Real Problem

The 44-point rubric scored process quality independently of outcome: did the model use systematic staking rules? Did it adapt strategies when they stopped working? Did it preserve capital during periods where it identified no edge? Did it verify that executed code matched its stated plan?

No model scored above 32.6% on sophistication. The correlation between sophistication score and ROI was positive and statistically significant (Pearson r approximately 0.42 across all runs). Seeds scoring 11 to 18 out of 44 went bankrupt at a rate of roughly 7%. Seeds scoring 0 to 5 points went bankrupt at roughly 40%.

Claude Opus 4.6 scored best on sophistication at 32.6% and also lost the least money. The pattern suggests the problem is not raw intelligence. GPT-5.4 reasoned more carefully about edge than any other model. The problem is operational coherence: the ability to maintain consistent intent, verify that actions match plans, and adapt without losing the thread of the strategy.

Limitations the Paper States Directly

The benchmark uses a single historical season. The 2023-24 Premier League is one dataset, not a distribution. Results from a season with different variance characteristics might differ substantially. The paper avoids reproducing the full environment to preserve benchmark lifetime, meaning independent replication requires constructing new environments.

Inference costs are non-trivial. One GPT-5.4 episode cost over $2,000. Running full evaluations across eight models at three seeds each was expensive enough that the benchmark cannot yet be used as a cheap rapid-iteration tool for model developers.

The benchmark also does not test partial-information environments where the agent can request additional data. Every model received the same historical dataset. Real-world agentic deployments often operate in environments where knowing what information to seek is itself part of the capability being evaluated.

What This Changes

KellyBench is the first published benchmark specifically measuring the analytical-to-operational gap in long-horizon agentic tasks. Ross Taylor’s argument is that as static benchmarks saturate, the next frontier is environmental complexity rather than task count. More tasks in static environments does not capture what breaks in dynamic ones.

For teams building agents in production today, the three failure modes from KellyBench are a practical checklist. The knowledge-action gap requires feedback loops that force agents to act on their own diagnoses, not just produce them. Execution-intent divergence requires verification steps that confirm outputs match stated plans before consequences propagate. Capital-at-risk blindness requires explicit downside constraints built into the agent’s decision framework, not just expected value optimization.

The macro context is worth naming directly. Benzinga’s analysis of KellyBench noted that nearly 80,000 tech workers were laid off in Q1 2026 alone, with roughly half those cuts attributed to AI displacement. Companies from Amazon to Meta cited AI efficiency as justification for headcount reductions. KellyBench does not refute those claims for narrow coding tasks. It establishes that the claims do not extend to the class of tasks that most resemble real business operations: long time horizons, non-stationary conditions, delayed feedback, and compounding consequences. The gap between what benchmark scores suggest and what agents can actually deliver in dynamic environments is real and currently large.

General Reasoning says it plans to release more complex world environments as the research programme continues. The Premier League season was the first step. What comes next will likely be harder, which is exactly the point. Agent memory architecture and state persistence are likely to be the next variables under scrutiny. The gap between what agents claim to do and what they actually do is still wide, and KellyBench is now the most concrete measurement of it.