The Real Cost of Running AI in 2026: Compute, Revenue, and Who Can Actually Afford It

OpenAI spent $8.67 billion on inference through September 2025. That figure doubled the company’s entire 2024 inference spend in under nine months. By February 2026, OpenAI reports $25 billion in annualized revenue and still projects $25 billion in cash burn for the year. Running frontier AI at scale is one of the most capital-intensive operations in technology history, and the economics are not improving as fast as the revenue numbers suggest.

Anthropic tells a different version of the same story. The company hit $19 billion in annualized run-rate revenue by early March 2026, up from $9 billion at the end of 2025. CEO Dario Amodei confirmed the number at a Morgan Stanley TMT conference, noting that $6 billion in annualized revenue was added in February alone. Yet in a March 2026 legal filing, Anthropic’s CFO stated the company had made over $5 billion in cumulative lifetime revenue while spending over $10 billion on inference and training combined. Every dollar of revenue has cost two dollars to generate.

This is the defining economic tension in AI right now. Revenue is growing fast. Spending is growing faster. The question is whether the capital committed in 2025 and 2026 holds long enough for unit economics to improve before the money runs out.

The Inference Cost Mechanism

Every token a user generates consumes compute. That cost is not fixed. It scales with model size, context length, and request volume, and all three have increased simultaneously since 2024.

OpenAI’s adjusted gross margin dropped from 40% in 2024 to 33% in 2025. Inference costs quadrupled in a single year. According to Epoch AI analysis and leaked financial documents covered by Ed Zitron’s newsletter, OpenAI spent $1.8 billion on inference in all of 2024. By September 2025, that figure had reached $8.67 billion for the year. The projected 2026 inference bill from Sacra sits at $14.1 billion, before accounting for the GPT-5.x generation models that are more expensive to serve than their predecessors.

The mechanism behind this growth is structural. Transformer attention scales quadratically with context length. Every token in a context window gets attended to on every generation step. As OpenAI pushed toward 128,000-token contexts and longer, and as users moved from single-turn queries to multi-turn conversations and agentic workflows, the compute per interaction multiplied. One agentic loop that hits the model 10 to 20 times per user task costs as much as 10 to 20 separate queries. Enterprise customers running code generation agents or document processing pipelines routinely send sessions with hundreds of thousands of tokens. By 2026, inference accounts for roughly 85% of enterprise AI budgets, according to FinOps for AI research published in Q1.

OpenAI’s own internal projections show losses of $14 billion in 2026 against approximately $13 billion in sales, with total spending at roughly $22 billion. HSBC analysts concluded that OpenAI likely will not make money by 2030 and still faces a $207 billion funding shortfall to reach its stated growth targets. Cash-flow positive is not expected until 2030. Anthropic targets that milestone in 2028, contingent on revenue growth continuing.

The GPU Supply Chain

The capital requirements trace directly to NVIDIA’s hardware dominance. OpenAI, Anthropic, Google, and Meta all compete for the same constrained supply of H100s, H200s, and now Blackwell B200 GPUs. No other vendor matches NVIDIA’s production volumes for AI inference chips at scale.

OpenAI has moved to reduce this dependency. The company is co-designing a custom inference chip with Broadcom, manufactured by TSMC, targeting 10 gigawatts of capacity between 2026 and 2029. Separately, a multi-year deal with Cerebras Systems valued at over $10 billion gives OpenAI 750 megawatts of ultra-low latency compute. GPT-5.3-Codex-Spark already runs on Cerebras’s Wafer Scale Engine 3 as a latency-first serving tier. Amazon deployed Cerebras CS-3 systems through AWS Bedrock alongside Trainium chips, claiming 5x token throughput over standard H100 deployments for the same workload.

These custom chips target a specific hardware bottleneck. During the decode phase of text generation, GPUs are memory-bandwidth-limited, not compute-limited. The raw FLOP capacity sits idle waiting for data to move between HBM memory and compute cores. Cerebras’s wafer-scale design integrates large on-chip SRAM directly adjacent to compute, eliminating the bottleneck. The 5x throughput improvement reflects this architectural difference.

The 5 gigawatts of compute OpenAI has committed to (3GW inference, 2GW training) requires new data center construction, power grid connections, and cooling systems across multiple geographies. Five gigawatts is roughly equivalent to the electricity consumption of a mid-sized European country. The Stargate project, a joint venture with SoftBank announced in January 2026, commits up to $500 billion in AI data center construction over four years. That infrastructure does not materialize in months. Meaningful custom silicon deployment runs through 2027 at the earliest.

The Circular Capital Problem

OpenAI closed a $110 billion fundraise in early 2026 at an $840 billion valuation, backed by Amazon, NVIDIA, and SoftBank. Anthropic closed a $30 billion Series G in February 2026 at a $380 billion valuation. Both rounds create a structurally awkward dynamic that financial analysts have flagged repeatedly.

OpenAI raises capital from Microsoft, then spends much of it at Microsoft Azure for inference. SoftBank invests in Stargate, whose proceeds partly flow back to Stargate’s corporate partners. Amazon invests in OpenAI, which then commits to 2 gigawatts of AWS Trainium compute. Investors are, in part, funding their own future revenues. The circularity is not a scandal, but it means that investors should think carefully about whether they are backing an AI model company or the infrastructure companies those models run on.

Neither company prices at cost. API prices are set to subsidize developer adoption and market capture, funded by venture capital and hyperscaler cross-subsidies. AI inference costs fell roughly 78% through 2025 for some provider tiers. The current pricing floor is artificial. Enterprises building AI-dependent workflows in 2026 are doing so at pricing that will normalize upward as capital discipline tightens. A Turing Award-winning Google researcher published a paper in early 2026 identifying inference costs as the primary economic bottleneck preventing AI companies from reaching profitability.

Why Efficiency Gains Are Not Lowering the Total Bill

This is the counterintuitive part. Google’s TurboQuant technique, covered in depth on this site, demonstrated that 3-bit KV cache quantization can compress memory requirements by 6x with minimal accuracy loss (see: Google TurboQuant Compresses LLM Memory by 6x). Meta, Google DeepMind, and multiple academic groups converged in early 2026 on sparse attention architectures reducing per-token compute cost on long contexts by 40 to 60%. Moonshot AI published Attention Residuals achieving equivalent pre-training loss with 1.25x less compute. A technique called IndexCache reuses token-level attention indices across transformer layers and requests.

Each of these makes individual inference cheaper. None of them has reduced the total inference bill, because usage scales faster than efficiency improves. As tokens get cheaper, developers build applications that consume more tokens. Agentic workflows that previously cost too much to run become standard. The total compute bill grows even as the unit cost falls.

This is the dynamic William Stanley Jevons documented in 19th-century coal consumption: improved steam engine efficiency made steam power economically attractive across more applications, which increased total coal use rather than decreasing it. The same loop runs in AI. OpenAI’s API price cuts of 2025 expanded developer adoption and total token volume, which expanded inference spending faster than the cuts improved per-token margins.

The Sora shutdown in March 2026 illustrated the sharpest consequence of this dynamic. OpenAI shut down the Sora video generation app as the company realigned spending toward enterprise contracts ahead of its planned IPO (covered here: Why OpenAI Killed Sora). The estimated $15 million daily inference cost made Sora unsustainable against a consumer product that did not generate equivalent revenue. Applications that burn compute without proportional revenue are the first to be cut as labs optimize their cost structure for investor scrutiny.

Anthropic’s Structural Advantage

Anthropic’s 40% gross margin versus OpenAI’s 33% reflects a real architectural difference. Claude models were designed with inference efficiency as a primary concern from the start of post-training. The company’s B2B focus means its customer base skews toward API access and enterprise contracts rather than free consumer usage, which reduces the cross-subsidy drag. Eight of the Fortune 10 are now Claude customers. Over 500 customers spend more than $1 million annually, up from a dozen two years ago.

Claude Code is the primary revenue driver for the company’s recent growth. The product hit $2.5 billion in annualized revenue by February 2026, more than doubling since the beginning of the year. A developer on a $200 monthly Max plan can generate estimated compute costs of over $100 per day at heavy usage, a ratio that Anthropic has acknowledged and has begun addressing through rate limits and tiered access. The economics of code generation agents are better than free consumer chat but still require significant capital to sustain at scale.

Epoch AI’s revenue trajectory analysis projects the crossover point, when Anthropic’s annualized revenue surpasses OpenAI’s, around August 2026 at approximately $43 billion each. Both companies have internally signaled they expect slower growth in the second half of 2026. At $43 billion in revenue with current cost structures, neither company is yet profitable. The crossover matters for competitive signaling. It does not change the fundamental math: revenue is growing but spend is growing faster.

What Happens to Pricing

The current API pricing is a bet that usage volume, once established, justifies the acquisition cost. The prices are subsidized. The question is when the subsidy ends.

HSBC’s conclusion, that OpenAI faces a $207 billion funding shortfall, implies that even the current fundraise trajectory does not close the gap without dramatic margin improvement. OpenAI’s internal forecasts project gross margins improving to 52 to 67% in coming years, but falling short of the prior 70% goal by 2029. Getting from 33% to 52% requires either meaningfully lower inference costs, higher prices, or both.

Enterprises signing multi-year AI contracts in 2026 should model two scenarios. In the first, compute efficiency gains from custom silicon and architectural improvements outpace demand growth, and prices hold or fall further. In the second, demand growth continues outpacing efficiency gains, capital discipline eventually reasserts itself, and API prices rise by 50 to 100% over the 2027 to 2029 window. The second scenario is more consistent with the math as it currently stands. Designing for model-agnosticism now, the ability to swap providers as pricing shifts, is the most durable architectural decision available to developers building on top of frontier AI.

The efficiency path and the capital path cannot both win on the same timeline. Every technique paper published in Q1 2026 is a bet on efficiency winning before the capital runway ends. Whether that bet pays off determines whether the current pricing is a permanent floor or a temporary promotion.

My Written Word

Qwen 3.5 9B Matches Models 13x Its Size: What Small Models Mean for Edge AI

Apple’s AI Reckoning: Why Siri Runs on Google’s Gemini Now

The AI Supply Chain Is the New Attack Surface: From Ultralytics to LiteLLM

Feature is an online magazine made by culture lovers. We offer weekly reflections, reviews, and news on art, literature, and music.

Thank you for your response. ✨

My Written Word

My Written Word

The Real Cost of Running AI in 2026: Compute, Revenue, and Who Can Actually Afford It

The Inference Cost Mechanism

The GPU Supply Chain

The Circular Capital Problem

Why Efficiency Gains Are Not Lowering the Total Bill

Anthropic’s Structural Advantage

What Happens to Pricing

Share this:

Qwen 3.5 9B Matches Models 13x Its Size: What Small Models Mean for Edge AI

Apple’s AI Reckoning: Why Siri Runs on Google’s Gemini Now

The AI Supply Chain Is the New Attack Surface: From Ultralytics to LiteLLM

Feature is an online magazine made by culture lovers. We offer weekly reflections, reviews, and news on art, literature, and music.

Thank you for your response. ✨

My Written Word