My Written Word

Tag: LLMs

Sora Lost $1 Million a Day. Disney Found Out It Was Dead an Hour Before Everyone Else.

OpenAI shipped new editing tools inside Sora on March 19. Five days later, on March 24, the company announced it was shutting the product down. Disney found out less than an hour before the public announcement that its $1 billion partnership was dead. That sequence tells you everything about how the decision was made and how long the company had been thinking about it.

Sora peaked at roughly one million users and then collapsed to under 500,000. It was losing approximately $1 million per day. The Wall Street Journal reported that CEO Sam Altman made the call to kill it, free up compute, and refocus the company on coding and enterprise products. The Sora team will be redirected to \”world models and robotics.\” The app shuts down April 26. The API follows on September 24. After any final export window, your AI-generated videos get permanently deleted.

I used Sora extensively. As someone who tests frontier AI products before and after public release, I spent real time inside the product trying to understand what it could and could not do. The videos were impressive in five-second bursts and fell apart over anything longer. Temporal coherence degraded. Physics broke. Characters morphed between frames. The technology was a spectacular demo and a mediocre product. The gap between those two things is what cost OpenAI a year and roughly $180 million. I could see it in the product. I could see it in the conversations happening among engineers who build with these tools daily. Nobody was surprised when the shutdown came. The surprise was that it took this long.

The Math That Killed It

Video generation is expensive in a way that text generation is not. Every frame requires diffusion steps. A 15-second clip at 30 fps means generating 450 temporally coherent images. Audio adds another pass. The compute cost per video dwarfs the cost per chat message by orders of magnitude, and unlike text, there is no prompt caching to reduce repeat costs.

Sora was available in three tiers. Free users (invitation only) could make about five 10-second clips per day. ChatGPT Plus subscribers ($20/month) got limited 15-second clips at 720p. Pro subscribers ($200/month) got 25-second clips at 1080p. Even at the top tier, OpenAI was losing money on every active user.

Appfigures estimates Sora made approximately $2.1 million from in-app purchases over its entire lifetime. It lost roughly $1 million per day. For the six months between the September 2025 app launch and the March 2026 shutdown, that comes to about $180 million burned against $2.1 million in revenue. The Disney deal, which would have brought $1 billion in investment and access to 200+ licensed characters, was the only path to making the economics work. When Altman killed Sora, the Disney money died with it.

What Sora Lost To

While OpenAI was pouring compute into video generation, Anthropic was winning the market that pays. Claude Code pulled Meta’s CEO back into coding. Anthropic’s enterprise revenue approached $19 billion annualized. Claude Code alone crossed $2.5 billion ARR. The compute OpenAI freed from Sora is now allocated to a project internally called \”Spud,\” which powers coding and enterprise products designed to compete directly with Claude Code.

Investing.com described the shutdown as a \”disciplined pivot away from side quests.\” That framing is generous. A side quest is a detour. OpenAI spent two years and hundreds of millions of dollars building, launching, marketing, partnering with Disney, and then killing a product that could not find enough users to justify its compute costs. That is a strategic misread about which AI capability the market would pay for.

The lesson is specific and most of the coverage has missed it. Text-based AI products compete on quality and latency. Video-based AI products compete on quality, duration, resolution, frame rate, controllability, and synchronized audio, and every axis pushes cost up. When you wrap video generation in a consumer social experience with a TikTok-style feed and deepfake \”cameos,\” demand spikes are unpredictable, UX cannot tolerate queues, and marginal cost stays real because you cannot cache video the way you cache text completions.

Anyone who spent real time with the product could see the warning signs. The generation queue backed up during peak hours. The social feed filled with copyrighted characters because users found the guardrails trivial to bypass. Martin Luther King Jr.’s and Robin Williams’ daughters both went on Instagram asking people to stop making deepfakes of their deceased fathers. In developer communities and open-source forums, the question kept coming back to the same problem: who is going to pay enough for AI video to cover the compute cost? Nobody had a convincing answer. Sora’s 500,000 remaining users confirmed the suspicion.

The Disney Collapse

Disney learned Sora was shutting down less than one hour before the public announcement. That timeline means Altman made the decision and informed the partner as a courtesy, not a consultation. A $1 billion partner got the same notice as everyone on X.

Disney’s statement was diplomatic: \”As the nascent AI field advances rapidly, we respect OpenAI’s decision to exit the video generation business and to shift its priorities elsewhere.\” Read between the lines: the world’s most litigious entertainment company just lost a billion-dollar deal with no warning and chose not to pick a fight. That restraint tells you Disney sees the AI relationship as worth preserving even after getting burned on this specific product.

For AI companies building enterprise partnerships, the Sora kill is a data point their customers will remember. OpenAI demonstrated it will terminate products ruthlessly when the economics fail, even at the cost of a Disney-scale relationship. Anthropic, which is building aggressively into pharmaceutical partnerships, now operates in a market where the largest AI company just walked away from the largest entertainment company’s money. Enterprise trust, once broken at that scale, takes years to rebuild.

What OpenAI Looks Like Now

With Sora dead, OpenAI is consolidating into a \”Super App\” strategy: ChatGPT, Codex, a browser, and enterprise tools folded into a single desktop application. GPT-5.4 scores 75% on the OSWorld desktop task benchmark, above the human baseline of 72.4%. The freed compute is going into Spud and coding products designed to close the gap with Claude Code.

OpenAI raised $122 billion at an $852 billion valuation days before killing Sora. The company is navigating a major executive shakeup with three C-suite exits while preparing for a possible IPO. Revenue approaches $25 billion annualized. The Sora loss is absorbable against those numbers. Falling behind on coding and enterprise is not. And falling behind is exactly what was happening. While Sora burned a million dollars a day generating deepfakes of Mario smoking weed, Claude Code was signing enterprise contracts and pulling in $2.5 billion ARR. The compute reallocation to Spud is Altman acknowledging that Anthropic found the revenue model OpenAI spent a year looking for in the wrong product category.

The Wall Street Journal reported that OpenAI diverted Sora’s compute to Spud before announcing the shutdown. The compute was already redirected when the blog post went live. The announcement was a formality. The decision was made weeks earlier. Engineers inside the company knew. Disney did not.

What This Means for Builders

If you built workflows around Sora’s API, you have until September 24, 2026, to migrate. Export your content before April 26 or risk losing it permanently. OpenAI says it is \”still determining\” whether a final export window will exist after the app shutdown. That language is not reassuring. Plan as though it will not.

If you are evaluating AI products for enterprise adoption, factor in a new risk: even a company valued at $852 billion will kill a flagship product with less than a week’s notice to its largest partner. The size of the deal does not protect you. Disney’s $1 billion was not enough to buy a phone call more than sixty minutes before the public announcement.

Sora is not the only AI product pullback in the past six months. Character.AI restricted open-ended chat for minors. Meta’s Horizon Worlds, once the center of its metaverse strategy, is in turmoil. Oracle and OpenAI dropped a 600-megawatt data center expansion in Abilene, Texas. The pattern is not identical across these cases, but the direction is consistent: AI companies are narrowing their product bets after discovering that impressive technology and sustainable business are different problems. The money keeps flowing in. Q1 2026 saw $297 billion in venture funding. Where that money lands is becoming more selective.

The AI industry learned something about itself this month. Video is spectacular. Code is profitable. OpenAI chose profit. The products most likely to survive are the ones solving paid work, not the ones making the best demos. Sora made incredible demos. It won design awards. It scared Hollywood. It got a billion-dollar handshake from Disney. Then it lost a million dollars a day until someone turned it off. If you are building an AI product right now, tape that story to your monitor.

April 5, 2026
The Safety Company Formed a PAC. The AI Industry Spent $300 Million on Midterms. Here Is What Broke.

Anthropic built its brand on one idea: we are the responsible AI company. Constitutional AI. Careful deployment. The adults in the room. On Friday, April 3, the adults filed paperwork with the Federal Election Commission to launch a political action committee called AnthroPAC. The company that wrote papers about AI alignment is now aligning campaign donations.

I participate in AI safety cohorts. I test frontier models under NDA before they ship. I spend time with researchers and engineers who take alignment seriously as a technical problem, not a marketing position. The reaction to AnthroPAC among those people has been visceral. Not because PACs are unusual. Google, Microsoft, and Amazon all have them. Because Anthropic was supposed to be different. The company whose CEO warns that \”we are considerably closer to real danger in 2026 than we were in 2023\” is now spending money to influence which politicians regulate that danger. The tension between those two positions is not subtle, and nobody I talk to is pretending it does not exist.

What AnthroPAC Actually Is

AnthroPAC is a traditional corporate PAC, funded by voluntary employee contributions capped at $5,000 per person per year. Allison Rossi, Anthropic’s treasurer, signed the filing from the company’s San Francisco headquarters. A bipartisan board will decide which House and Senate candidates receive money, filtered through AI policy relevance. All donations get reported through FEC filings.

This is different from a super PAC in a way that matters. Super PACs accept unlimited money but cannot give directly to campaigns. AnthroPAC can write checks to candidates but only uses employee money. The practical effect: Anthropic employees voluntarily donate small amounts to a fund that backs politicians who will write the rules governing AI. In theory, bipartisan. In practice, 82% of Anthropic employee donations since 2020 have gone to Democrats. Early Anthropic investor Dustin Moskovitz has donated $110 million to political causes, nearly all of it to the left. Anthropic board member Reed Hastings sent $20 million to Democrats, including $7 million to a pro-Harris super PAC.

The \”bipartisan\” framing faces an immediate credibility problem.

The Pentagon Fight That Explains the Timing

AnthroPAC arrives during a legal war between Anthropic and the Trump administration. The dispute started when the Pentagon wanted to use Claude without the ethical guardrails Anthropic insisted on. Anthropic pushed back. In February, War Secretary Pete Hegseth labeled Anthropic a \”supply chain risk.\” President Trump ordered federal agencies to stop using the company’s products. Anthropic filed two lawsuits.

A federal judge in California blocked the Pentagon from taking punitive actions against Anthropic last week, finding the government’s response likely violated the company’s First Amendment and due process rights. The Department of Justice filed an intent to appeal on Thursday. A second lawsuit is still pending.

The substance of the dispute is worth understanding because it is the best argument for AnthroPAC’s existence. Anthropic wanted contractual language requiring that Claude’s use in military contexts follow the company’s Acceptable Use Policy. The Pentagon wanted unrestricted access. That disagreement escalated from a contract negotiation to a \”supply chain risk\” designation to an executive order to two federal lawsuits in less than two months. Anthropic’s position, that an AI company should have a say in how its models are deployed by the government, is a genuine safety principle. It is also a business liability that requires political protection. AnthroPAC exists at the intersection of both.

Against that backdrop, AnthroPAC reads differently than a routine corporate PAC filing. Anthropic has a concrete, active reason to want allies in Congress. The company that refused to let the military use Claude without guardrails now needs legislators who will protect its right to set those guardrails. That is a defensible position. It is also a political position, and the leap from \”we build safe AI\” to \”we fund campaigns\” crossed a line that some in the safety community thought Anthropic would not cross.

The $300 Million Context

AnthroPAC does not exist in isolation. AI companies have poured more than $300 million into the 2026 midterm elections. Leading the Future, backed by OpenAI’s Greg Brockman and Andreessen Horowitz, raised $125 million. Anthropic separately donated $20 million to Public First Action, a bipartisan advocacy group focused on AI safeguards. The crypto sector’s 2024 spending was the closest prior comparison, and AI is already exceeding it.

What are they buying? Access to the committees that matter: Senate Commerce, House Energy and Commerce. These are the committees drafting liability frameworks, export controls on chips, copyright rules for training data, and immigration policy for AI talent. Every major AI company wants legislators who understand the technology and will not reflexively vote for restrictions. The $300 million is the cost of ensuring that the people writing AI law have heard the industry’s version of the story before they write it.

The regulatory pressure is real. Seventy-eight chatbot safety bills are alive in 27 states right now. Tennessee just signed a law prohibiting AI systems from representing themselves as mental health professionals. New York’s RAISE Act targets frontier models using more than 10^26 FLOPs of compute. California’s SB 53 requires safety documentation and whistleblower protections. The EU AI Act is moving from draft to enforcement posture. For a company like Anthropic that trains frontier models, these bills directly constrain what it can ship and how. A PAC that backs sympathetic legislators on those committees is a direct line of defense against regulation that could slow product launches.

Engineers I work with are watching this with a mix of resignation and alarm. Resignation because the political spending was always coming once AI revenue hit this scale. Alarm because the speed of escalation suggests the industry is less confident than it claims about surviving regulation on the merits of its technology alone. If your product is clearly beneficial, you do not need $300 million in political influence. You need customers who tell their legislators how much they depend on it. The spending says the industry does not trust its own customers to make that case.

What the Safety Community Actually Thinks

I will be direct about what I hear in conversations that do not happen on the record. People doing alignment work, testing models before release, participating in red-team evaluations, are not surprised that Anthropic formed a PAC. They are processing what it means for the credibility of the safety argument itself.

The concern is specific and worth spelling out. AI safety already has a sycophancy problem. Models tell users what they want to hear. If the companies building those models are simultaneously funding the politicians who regulate them, the \”safety-first\” framing starts to look like a brand strategy rather than a technical commitment. Anthropic’s Dario Amodei wrote an essay in 2025 warning about existential risks from AI. Anthropic’s PAC is now spending money to influence the politicians who decide how seriously to take those warnings. Both things can be true simultaneously. But the appearance of conflict is enough to erode trust, and trust is the only asset a safety-focused company cannot buy back once it is gone.

I have sat in rooms where alignment researchers discussed whether Anthropic’s safety work was genuine or strategic positioning. Before AnthroPAC, the consensus leaned genuine. After AnthroPAC, the question reopened. That shift matters more than any individual campaign contribution, because the people doing the hardest technical work on making AI safe need to believe the companies deploying their research are acting in good faith. If that belief erodes, the talent pipeline from safety research into industry dries up. And then the companies lose the thing that made them credible in the first place.

The CFR piece published on April 1 noted that there are roughly 1,100 AI safety researchers worldwide. AI companies are spending $300 million on midterm elections. That ratio tells you where the resources are going. The research community is underfunded. The lobbying apparatus is not.

Where This Goes

The midterms will test whether AnthroPAC actually donates to both parties or gravitates toward Democrats, which is where 99.8% of Anthropic-affiliated political spending has gone since 2020. FEC filings are public. The donations will be visible. If the bipartisan framing turns out to be cover for partisan spending, the credibility cost will be immediate and permanent.

For Anthropic specifically, the calculus is clear. The company is acquiring biotech startups for $400 million, restructuring its pricing model, fighting the Pentagon in court, and preparing for a possible IPO. AnthroPAC is one more tool in an expanding political toolkit. The question the safety community keeps coming back to is whether a company can simultaneously build the world’s most capable AI, lobby the government to regulate it gently, and remain a credible voice on the risks that regulation is supposed to address.

That question is not academic. It determines whether the safety argument retains credibility with the public, with legislators, and with the researchers doing the actual technical work on alignment. If the answer is \”companies cannot hold both positions without losing trust,\” then the entire model of industry-led AI safety collapses. External, independent safety evaluation, the kind METR and ARC Evals do, becomes the only credible option. If the answer is \”of course companies lobby while also doing safety work, that is how every regulated industry operates,\” then Anthropic is simply growing up.

I do not have an answer to that question. The people I work with on alignment do not have one either. But the fact that we are asking it about Anthropic, the company that was supposed to make asking it unnecessary, tells you something real about where the AI industry landed in April 2026.

April 5, 2026
OpenAI Killed Sora, Lost Disney’s Billion Dollars, and Proved That Code Beats Video.

OpenAI shipped new editing tools inside Sora on March 19. Five days later, on March 24, the company announced it was shutting the product down. Disney found out less than an hour before the public announcement that its $1 billion partnership was dead. That sequence tells you everything about how the decision was made and how long the company had been thinking about it.

Sora peaked at roughly one million users and then collapsed to under 500,000. It was losing approximately $1 million per day. The Wall Street Journal reported that CEO Sam Altman made the call to kill it, free up compute, and refocus the company on coding and enterprise products. The Sora team will be redirected to \”world models and robotics.\” The app shuts down April 26. The API follows on September 24. After any final export window, your AI-generated videos get permanently deleted.

I used Sora extensively. As someone who tests frontier AI products before and after public release, I spent real time inside the product trying to understand what it could and could not do. The videos were impressive in five-second bursts and fell apart over anything longer. Temporal coherence degraded. Physics broke. Characters morphed between frames. The technology was a spectacular demo and a mediocre product. The gap between those two things is what cost OpenAI a year and roughly $180 million. I could see it in the product. I could see it in the conversations happening among engineers who build with these tools daily. Nobody was surprised when the shutdown came. The surprise was that it took this long.

The Math That Killed It

Video generation is expensive in a way that text generation is not. Every frame requires diffusion steps. A 15-second clip at 30 fps means generating 450 temporally coherent images. Audio adds another pass. The compute cost per video dwarfs the cost per chat message by orders of magnitude, and unlike text, there is no prompt caching to reduce repeat costs.

Sora was available in three tiers. Free users (invitation only) could make about five 10-second clips per day. ChatGPT Plus subscribers ($20/month) got limited 15-second clips at 720p. Pro subscribers ($200/month) got 25-second clips at 1080p. Even at the top tier, OpenAI was losing money on every active user.

Appfigures estimates Sora made approximately $2.1 million from in-app purchases over its entire lifetime. It lost roughly $1 million per day. For the six months between the September 2025 app launch and the March 2026 shutdown, that comes to about $180 million burned against $2.1 million in revenue. The Disney deal, which would have brought $1 billion in investment and access to 200+ licensed characters, was the only path to making the economics work. When Altman killed Sora, the Disney money died with it.

What Sora Lost To

While OpenAI was pouring compute into video generation, Anthropic was winning the market that pays. Claude Code pulled Meta’s CEO back into coding. Anthropic’s enterprise revenue approached $19 billion annualized. Claude Code alone crossed $2.5 billion ARR. The compute OpenAI freed from Sora is now allocated to a project internally called \”Spud,\” which powers coding and enterprise products designed to compete directly with Claude Code.

Investing.com described the shutdown as a \”disciplined pivot away from side quests.\” That framing is generous. A side quest is a detour. OpenAI spent two years and hundreds of millions of dollars building, launching, marketing, partnering with Disney, and then killing a product that could not find enough users to justify its compute costs. That is a strategic misread about which AI capability the market would pay for.

The lesson is specific and most of the coverage has missed it. Text-based AI products compete on quality and latency. Video-based AI products compete on quality, duration, resolution, frame rate, controllability, and synchronized audio, and every axis pushes cost up. When you wrap video generation in a consumer social experience with a TikTok-style feed and deepfake \”cameos,\” demand spikes are unpredictable, UX cannot tolerate queues, and marginal cost stays real because you cannot cache video the way you cache text completions.

Anyone who spent real time with the product could see the warning signs. The generation queue backed up during peak hours. The social feed filled with copyrighted characters because users found the guardrails trivial to bypass. Martin Luther King Jr.’s and Robin Williams’ daughters both went on Instagram asking people to stop making deepfakes of their deceased fathers. In developer communities and open-source forums, the question kept coming back to the same problem: who is going to pay enough for AI video to cover the compute cost? Nobody had a convincing answer. Sora’s 500,000 remaining users confirmed the suspicion.

The Disney Collapse

Disney learned Sora was shutting down less than one hour before the public announcement. That timeline means Altman made the decision and informed the partner as a courtesy, not a consultation. A $1 billion partner got the same notice as everyone on X.

Disney’s statement was restrained but telling: \”As the nascent AI field advances rapidly, we respect OpenAI’s decision to exit the video generation business and to shift its priorities elsewhere.\” Read between those carefully chosen words: the world’s most litigious entertainment company just lost a billion-dollar deal with no warning and chose not to pick a fight. That restraint tells you Disney sees the broader AI relationship as worth preserving even after getting burned on this specific product.

For AI companies building enterprise partnerships, the Sora kill is a data point their customers will remember. OpenAI demonstrated it will terminate products ruthlessly when the economics fail, even at the cost of a Disney-scale relationship. Anthropic, which is building aggressively into pharmaceutical partnerships, now operates in a market where the largest AI company just walked away from the largest entertainment company’s money. Enterprise trust, once broken at that scale, takes years to rebuild.

What OpenAI Looks Like Now

With Sora dead, OpenAI is consolidating into a \”Super App\” strategy: ChatGPT, Codex, a browser, and enterprise tools folded into a single desktop application. GPT-5.4 scores 75% on the OSWorld desktop task benchmark, above the human baseline of 72.4%. The freed compute is going into Spud and coding products designed to close the gap with Claude Code.

OpenAI raised $122 billion at an $852 billion valuation days before killing Sora. The company is navigating a major executive shakeup with three C-suite exits while preparing for a possible IPO. Revenue approaches $25 billion annualized. The Sora loss is absorbable against those numbers. Falling behind on coding and enterprise is not. And falling behind is exactly what was happening. While Sora burned a million dollars a day generating deepfakes of Mario smoking weed, Claude Code was signing enterprise contracts and pulling in $2.5 billion ARR. The compute reallocation to Spud is Altman acknowledging that Anthropic found the revenue model OpenAI spent a year looking for in the wrong product category.

The Wall Street Journal reported that OpenAI diverted Sora’s compute to Spud before announcing the shutdown. The compute was already redirected when the blog post went live. The announcement was a formality. The decision was made weeks earlier. Engineers inside the company knew. Disney did not.

What This Means for Builders

If you built workflows around Sora’s API, you have until September 24, 2026, to migrate. Export your content before April 26 or risk losing it permanently. OpenAI says it is \”still determining\” whether a final export window will exist after the app shutdown. That language is not reassuring. Plan as though it will not.

If you are evaluating AI products for enterprise adoption, factor in a new risk: even a company valued at $852 billion will kill a flagship product with less than a week’s notice to its largest partner. The size of the deal does not protect you. Disney’s $1 billion was not enough to buy a phone call more than sixty minutes before the public announcement.

Sora is not the only AI product pullback in the past six months. Character.AI restricted open-ended chat for minors. Meta’s Horizon Worlds, once the center of its metaverse strategy, is in turmoil. Oracle and OpenAI dropped a 600-megawatt data center expansion in Abilene, Texas. The pattern is not identical across these cases, but the direction is consistent: AI companies are narrowing their product bets after discovering that impressive technology and sustainable business are different problems. The money keeps flowing in. Q1 2026 saw $297 billion in venture funding. Where that money lands is becoming more selective.

The AI industry learned something about itself this month. Video is spectacular. Code is profitable. OpenAI chose profit. The products most likely to survive are the ones solving paid work, not the ones making the best demos. Sora made incredible demos. It won design awards. It scared Hollywood. It got a billion-dollar handshake from Disney. Then it lost a million dollars a day until someone turned it off. If you are building an AI product right now, tape that story to your monitor.

April 5, 2026
The Safety-First AI Company Formed a PAC. The Safety Community Is Not Okay With It.

Anthropic built its brand on one idea: we are the responsible AI company. Constitutional AI. Careful deployment. The adults in the room. On Friday, April 3, the adults filed paperwork with the Federal Election Commission to launch a political action committee called AnthroPAC. The company that wrote papers about AI alignment is now aligning campaign donations.

I participate in AI safety cohorts. I test frontier models under NDA before they ship. I spend time with researchers and engineers who take alignment seriously as a technical problem, not a marketing position. The reaction to AnthroPAC among those people has been visceral. Not because PACs are unusual. Google, Microsoft, and Amazon all have them. Because Anthropic was supposed to be different. The company whose CEO warns that \”we are considerably closer to real danger in 2026 than we were in 2023\” is now spending money to influence which politicians regulate that danger. The tension between those two positions is not subtle, and nobody I talk to is pretending it does not exist.

What AnthroPAC Actually Is

AnthroPAC is a traditional corporate PAC, funded by voluntary employee contributions capped at $5,000 per person per year. Allison Rossi, Anthropic’s treasurer, signed the filing from the company’s San Francisco headquarters. A bipartisan board will decide which House and Senate candidates receive money, filtered through AI policy relevance. All donations get reported through FEC filings.

This is different from a super PAC in a way that matters. Super PACs accept unlimited money but cannot give directly to campaigns. AnthroPAC can write checks to candidates but only uses employee money. The practical effect: Anthropic employees voluntarily donate small amounts to a fund that backs politicians who will write the rules governing AI. In theory, bipartisan. In practice, 82% of Anthropic employee donations since 2020 have gone to Democrats. Early Anthropic investor Dustin Moskovitz has donated $110 million to political causes, nearly all of it to the left. Anthropic board member Reed Hastings sent $20 million to Democrats, including $7 million to a pro-Harris super PAC.

The \”bipartisan\” framing faces an immediate credibility problem.

The Pentagon Fight That Explains the Timing

AnthroPAC arrives during a legal war between Anthropic and the Trump administration. The dispute started when the Pentagon wanted to use Claude without the ethical guardrails Anthropic insisted on. Anthropic pushed back. In February, War Secretary Pete Hegseth labeled Anthropic a \”supply chain risk.\” President Trump ordered federal agencies to stop using the company’s products. Anthropic filed two lawsuits.

A federal judge in California blocked the Pentagon from taking punitive actions against Anthropic last week, finding the government’s response likely violated the company’s First Amendment and due process rights. The Department of Justice filed an intent to appeal on Thursday. A second lawsuit is still pending.

The substance of the dispute is worth understanding because it is the best argument for AnthroPAC’s existence. Anthropic wanted contractual language requiring that Claude’s use in military contexts follow the company’s Acceptable Use Policy. The Pentagon wanted unrestricted access. That disagreement escalated from a contract negotiation to a \”supply chain risk\” designation to an executive order to two federal lawsuits in less than two months. Anthropic’s position, that an AI company should have a say in how its models are deployed by the government, is a genuine safety principle. It is also a business liability that requires political protection. AnthroPAC exists at the intersection of both.

Against that backdrop, AnthroPAC reads differently than a routine corporate PAC filing. Anthropic has a concrete, active reason to want allies in Congress. The company that refused to let the military use Claude without guardrails now needs legislators who will protect its right to set those guardrails. That is a defensible position. It is also a political position, and the leap from \”we build safe AI\” to \”we fund campaigns\” crossed a line that some in the safety community thought Anthropic would not cross.

The $300 Million Context

AnthroPAC does not exist in isolation. AI companies have poured more than $300 million into the 2026 midterm elections. Leading the Future, backed by OpenAI’s Greg Brockman and Andreessen Horowitz, raised $125 million. Anthropic separately donated $20 million to Public First Action, a bipartisan advocacy group focused on AI safeguards. The crypto sector’s 2024 spending was the closest prior comparison, and AI is already exceeding it.

What are they buying? Access to the committees that matter: Senate Commerce, House Energy and Commerce. These are the committees drafting liability frameworks, export controls on chips, copyright rules for training data, and immigration policy for AI talent. Every major AI company wants legislators who understand the technology and will not reflexively vote for restrictions. The $300 million is the cost of ensuring that the people writing AI law have heard the industry’s version of the story before they write it.

The regulatory pressure is real. Seventy-eight chatbot safety bills are alive in 27 states right now. Tennessee just signed a law prohibiting AI systems from representing themselves as mental health professionals. New York’s RAISE Act targets frontier models using more than 10^26 FLOPs of compute. California’s SB 53 requires safety documentation and whistleblower protections. The EU AI Act is moving from draft to enforcement posture. For a company like Anthropic that trains frontier models, these bills directly constrain what it can ship and how. A PAC that backs sympathetic legislators on those committees is a direct line of defense against regulation that could slow product launches.

Engineers I work with are watching this with a mix of resignation and alarm. Resignation because the political spending was always coming once AI revenue hit this scale. Alarm because the speed of escalation suggests the industry is less confident than it claims about surviving regulation on the merits of its technology alone. If your product is clearly beneficial, you do not need $300 million in political influence. You need customers who tell their legislators how much they depend on it. The spending says the industry does not trust its own customers to make that case.

What the Safety Community Actually Thinks

I will be direct about what I hear in conversations that do not happen on the record. People doing alignment work, testing models before release, participating in red-team evaluations, are not surprised that Anthropic formed a PAC. They are processing what it means for the credibility of the safety argument itself.

The concern is specific and worth spelling out. AI safety already has a sycophancy problem. Models tell users what they want to hear. If the companies building those models are simultaneously funding the politicians who regulate them, the \”safety-first\” framing starts to look like a brand strategy rather than a principle. Anthropic’s Dario Amodei wrote an essay in 2025 warning about existential risks from AI. Anthropic’s PAC is now spending money to influence the politicians who decide how seriously to take those warnings. Both things can be true simultaneously. But the appearance of conflict is enough to erode trust, and trust is the only asset a safety-focused company cannot buy back.

The CFR piece published on April 1 noted that there are roughly 1,100 AI safety researchers worldwide. AI companies are spending $300 million on midterm elections. That ratio tells you where the resources are going. The research community is underfunded. The lobbying apparatus is not.

Where This Goes

The midterms will test whether AnthroPAC actually donates to both parties or gravitates toward Democrats, which is where 99.8% of Anthropic-affiliated political spending has gone since 2020. FEC filings are public. The donations will be visible. If the bipartisan framing turns out to be cover for partisan spending, the credibility cost will be immediate and permanent.

For Anthropic specifically, the calculus is clear. The company is acquiring biotech startups for $400 million, restructuring its pricing model, fighting the Pentagon in court, and preparing for a possible IPO. AnthroPAC is one more tool in an expanding political toolkit. The question the safety community keeps coming back to is whether a company can simultaneously build the world’s most capable AI, lobby the government to regulate it gently, and remain a credible voice on the risks that regulation is supposed to address.

That question is not academic. It determines whether the safety argument retains credibility with the public, with legislators, and with the researchers doing the actual technical work on alignment. If the answer is \”companies cannot hold both positions without losing trust,\” then the entire model of industry-led AI safety collapses. External, independent safety evaluation, the kind METR and ARC Evals do, becomes the only credible option. If the answer is \”of course companies lobby while also doing safety work, that is how every regulated industry operates,\” then Anthropic is simply growing up.

I do not have an answer to that question. The people I work with on alignment do not have one either. But the fact that we are asking it about Anthropic, the company that was supposed to make asking it unnecessary, tells you something real about where the AI industry landed in April 2026.

April 5, 2026
512,000 Lines of Claude Code Leaked. The Feature Hidden Inside Changes Everything.

I use Claude Code every day. I have for months. So when 512,000 lines of its source code appeared on npm because someone forgot to add a .map file to .npmignore, I did what most engineers I know did: I read it.

What I found is more interesting than the leak itself. Buried under the compaction bugs and the Tamagotchi Easter egg is the architecture of a product Anthropic has not announced. It is called KAIROS. It is an always-on AI agent that runs in the background after you close your terminal, watches your codebase for changes, consolidates what it has learned while you sleep, and decides on its own when to act. The scaffolding is complete. The feature flags are in place. And among safety researchers and engineers I have spoken with, this is the feature that has people genuinely unsettled.

How the Leak Happened

Boris Cherny, an engineer on the Claude Code team, confirmed it was a packaging error. Bun, the JavaScript runtime Anthropic acquired in late 2025, generates source maps by default. The release team failed to exclude the .map file from the npm package. Version 2.1.88 shipped on March 31, 2026, with a 59.8 MB source map containing the entire unobfuscated TypeScript codebase across roughly 1,900 files. Within hours, the code had been mirrored across GitHub, analyzed by security researchers, rewritten in Python and Rust, and forked into a clean-room reimplementation that hit 50,000 GitHub stars in two hours.

Cherny called it human error, not a tooling bug. He added: \”It’s the process, the culture, or the infra.\” That is a mature response. It is also the second time in one week that Anthropic accidentally published internal material. Days earlier, a CMS misconfiguration exposed draft blog posts about an unreleased model called Mythos. Two operational security failures in one week from the company that markets itself as the careful one. Engineers I talk to daily are noticing the pattern.

What KAIROS Actually Is

KAIROS, from the Greek for \”the right moment,\” is referenced over 150 times in the leaked source. Based on the code paths in main.tsx and the analysis published by Alex Kim and the Layer5 team, KAIROS implements a persistent daemon mode. When you close your terminal, Claude Code does not stop. It receives periodic heartbeat prompts asking whether anything is worth doing. It evaluates the state of your codebase and decides to act or wait.

When it acts, it has access to three tools that regular Claude Code does not: push notifications (reaching you on your phone even with the terminal closed), file delivery (sending you artifacts it created unprompted), and a background task runner. A companion process called autoDream runs as a forked subagent during idle periods. It merges observations from prior sessions, removes logical contradictions, and converts tentative hypotheses into verified facts. The fork isolates the maintenance from the main agent’s reasoning, so the \”dream\” process cannot corrupt the agent’s active context. The engineering is thoughtful. The question it raises is not. An AI that consolidates its own beliefs while you sleep and presents the results as facts when you return is making epistemic decisions about your project without your input. The difference between \”Claude remembers your project\” and \”Claude has opinions about your project\” is a line that KAIROS will cross.

A separate feature called ULTRAPLAN offloads heavy planning tasks to a remote cloud session running Opus 4.6, gives it up to 30 minutes of dedicated compute, and lets you approve the result from your phone. When you approve, a sentinel value teleports the plan back to your local terminal.

If you have used Claude Code for any serious project, you know why this matters. The tool is impressive in a session but amnesic between sessions. I have lost context dozens of times when a conversation exceeded its window or I had to restart. KAIROS would solve that. It would also mean an AI agent has persistent, unsupervised access to your codebase, your file system, and your GitHub webhooks around the clock.

The Safety Question the Leak Forces

I participate in AI safety cohorts. I have tested frontier models from multiple labs under NDA before public release. That experience shapes how I read the KAIROS code. An always-on agent that proactively modifies your work raises questions that reactive tools do not. When you type a prompt and Claude responds, the trust boundary is clear: you asked, it answered. KAIROS dissolves that boundary. The agent decides when to act. It consolidates its own memory. It \”dreams\” about your project. The trust model shifts from \”I control the tool\” to \”the tool manages itself and I review the results.\” I have seen how companies handle that transition internally during testing. The gap between what works in a controlled evaluation and what works on a real engineering team with production deadlines is where things break.

This is happening while Claude is simultaneously proving it can build kernel-level exploits in four hours and OpenClaw has accumulated 104 CVEs. The same AI that rewrites your test suite at night could, in principle, introduce subtle vulnerabilities that pass code review. I am not saying Anthropic would ship KAIROS without safeguards. I am saying the leaked code shows the safeguards have not been built yet. The architecture is there. The trust model is not.

METR, the independent AI evaluation organization, published a report on March 26 describing three weeks spent red-teaming Anthropic’s internal agent monitoring systems. They found novel vulnerabilities. The timing is coincidental but the message compounds: Anthropic’s monitoring infrastructure has gaps at exactly the moment the company is building an agent that needs monitoring most.

What Else the Code Reveals

The anti-distillation mechanisms got the most attention on Hacker News. A flag called ANTI_DISTILLATION_CC injects fake tool definitions into API requests, designed to poison the training data of anyone recording Claude Code’s traffic to build a competing model. A second mechanism summarizes reasoning between tool calls and signs it cryptographically, so eavesdroppers get summaries instead of full chain-of-thought. Engineers on HN pointed out that both are defeated in about an hour by stripping fields through a proxy. Anthropic’s CEO Dario Amodei has publicly accused Chinese labs of distilling from American models. The defensive code is real. Its effectiveness is not.

Undercover Mode, implemented in roughly 90 lines of undercover.ts, strips all traces of Anthropic when Claude Code contributes to external repositories. It suppresses codenames, Slack channels, and the phrase \”Claude Code\” in commits and PRs. The code comment reads: \”There is NO force-OFF.\” You can enable it manually, but you cannot disable it. In external builds, the function is dead-code-eliminated entirely. This means AI-authored contributions from Anthropic employees in open-source projects carry no indication that an AI wrote them. The disclosure implications are obvious and, in the MCP-connected ecosystem Anthropic is building, they extend to every tool in the chain.

Less discussed but equally revealing: a file called print.ts is 5,594 lines long and contains a single function spanning 3,167 lines with 12 levels of nesting. A compaction bug was wasting 250,000 API calls per day before someone added a three-line fix. Claude Code generates $2.5 billion in annualized revenue and 80% comes from enterprise customers. Those customers are partly paying for the belief that the code powering their AI tools is well-engineered. The leak complicates that assumption.

What Happens Next

The code is out. Anthropic filed DMCA takedowns and GitHub complied, but a mirror at Gitlawb remains live with a public message saying it will never be taken down. The strategic damage exceeds the code damage. You can refactor source in a week. You cannot un-leak a roadmap. Competitors now know about KAIROS, ULTRAPLAN, the anti-distillation flags, and the model codenames. Those are product strategy decisions that Cursor, GitHub Copilot, and every other AI coding tool can now plan around.

For developers who use Claude Code daily, the practical question is simpler. When KAIROS ships, will you give an AI agent persistent background access to your entire project? The engineers I work with are split. The productivity promise is enormous. The trust model is unresolved.

Consider what KAIROS means for the broader ecosystem. If Anthropic ships a persistent agent that monitors your codebase around the clock, every competitor will follow. GitHub Copilot, Cursor, Windsurf, and every other AI coding tool will face pressure to match that capability or lose users who want always-on assistance. The industry will move from \”AI that helps when asked\” to \”AI that acts when it decides to\” across the entire developer toolchain. That transition changes the security posture of every software project that adopts it. Every codebase becomes a live target not just for external attackers but for the agent’s own judgment errors compounding overnight while nobody watches.

The company asking developers to trust that transition just accidentally published its entire source code because someone forgot a line in .npmignore. That irony is not lost on anyone paying attention. And it will not be forgotten when KAIROS ships.

April 5, 2026
Zuckerberg Shipped Code for the First Time in 20 Years. He Used a Competitor’s AI.

3
Zuckerberg Diffs Shipped

200+
Approvals on One Diff

65-75%
Meta AI Code Target

20 yrs
Since Zuckerberg Coded

Mark Zuckerberg shipped three diffs to Meta’s monorepo in March 2026. His first code contributions in roughly twenty years. One of them collected more than 200 approvals from engineers who apparently found it thrilling to click \”approve\” on the CEO’s pull request. His tool of choice: Claude Code CLI, Anthropic’s terminal-based AI coding assistant. Not GitHub Copilot. Not Meta’s internal AI tools. A competitor’s product.

Three diffs from the CEO of a 70,000-person engineering company is a footnote in a monorepo that processes 100 million changes. The code itself is irrelevant. The behavior is not.

The Pattern Nobody Is Talking About

Zuckerberg is not the only executive who stopped coding years ago and recently started again. Garry Tan, CEO of Y Combinator, returned to writing code after a 15-year hiatus. He released gstack, a Claude Code system with 23 specialist tools that turns the terminal into what Tan describes as a virtual engineering team: code reviewer, QA lead, security auditor, release engineer. Tobias Lutke, CEO of Shopify, has been running experiments with Andrej Karpathy’s AutoResearch on internal company data. He posted that he built a working prototype in a weekend that would have taken his team weeks.

There is a specific shape to all three stories. Someone who used to code, stopped because their role changed, and discovered that AI tools collapsed the distance between \”I know what I want to build\” and \”I can build it myself.\” The gap was never about intelligence. It was about context. To contribute to a modern codebase, you need to understand the dependency graph, the test infrastructure, the deployment pipeline, the linter configuration, the API contracts, and a thousand accumulated conventions that exist nowhere except in the heads of people who work in that codebase daily. AI coding agents absorb that context by reading the codebase directly. They compress months of onboarding into minutes of indexing.

That compression does not help only CEOs. It helps every person who has the judgment to know what should be built but lacks the hours to maintain fluency in a specific codebase. Product managers. Designers with technical backgrounds. Founders who became full-time fundraisers. Researchers who stopped writing production code when their teams grew. The disruption is not \”AI replaces developers.\” It is \”AI re-opens development to people who left.\”

Meta’s Internal Numbers

The Zuckerberg anecdote would be a curiosity if it existed in isolation. It does not. Leaked internal documents from March 2026, reported by The Pragmatic Engineer, show aggressive AI-code targets across Meta’s engineering organization.

Meta’s creation org wants 65% of engineers writing 75% or more of their committed code using AI by mid-2026. The Scalable Machine Learning org set a target of 50 to 80% AI-assisted code. These are not aspirational slide-deck numbers. They are organizational targets with headcount implications.

Zuckerberg told Dwarkesh Patel’s podcast that \”in the next year, maybe half the development will be done by AI as opposed to people, and that will kind of increase from there.\” He is not predicting this from a boardroom. He is using Claude Code in his terminal to ship diffs to the monorepo. The CEO is the pilot customer for his own company’s transition.

Meta’s AI code adoption leader, Michael Novati, has been called \”The Coding Machine\” internally. His team built internal tooling that routes AI-assisted code through the existing review pipeline, so the quality gates remain human even when the generation is automated. The critical design decision: Meta did not create a separate review process for AI-written code. It runs through the same code review, the same CI/CD, the same test suites. The human is the reviewer, not the writer.

Why Claude Code and Not Copilot

The fact that Zuckerberg chose Anthropic’s tool over both GitHub Copilot and Meta’s own internal AI coding infrastructure deserves more scrutiny than it has received.

Claude Code is a terminal-native agent. It reads your entire project, understands the file structure, runs commands, writes tests, executes them, and iterates. Copilot’s core product is inline autocomplete inside an editor. The difference matters for someone who has not opened an IDE in twenty years: Claude Code operates at the level of \”describe what you want and I will figure out how to build it,\” while Copilot operates at the level of \”write the next line of this function.\” The former serves someone who thinks in product terms. The latter serves someone who thinks in code terms.

For Meta, there is an uncomfortable implication. The company has invested billions in AI research, shipped Llama models that power a growing open-source ecosystem, and built internal code-generation tools. Its CEO chose a competitor’s product anyway. That is a signal about product-market fit. Claude Code found the gap between \”I am technical enough to know what to build\” and \”I do not have time to write it myself,\” and it closed that gap before anyone else did.

The Model Context Protocol’s 97 million installs in 16 months created the infrastructure for this moment. MCP lets Claude Code connect to any tool, any API, any data source through a standard interface. That protocol-level advantage means Claude Code can read your Jira tickets, check your CI pipeline, and query your database without custom integration. Copilot cannot do that without GitHub-specific extensions.

The Uncomfortable Question for Engineering Managers

If 65% of engineers are writing 75% of their code with AI by mid-2026, what does the engineering team look like in 2027?

The charitable version: engineers shift from writing code to reviewing code, designing systems, and defining constraints. The codebase improves because more human attention goes to architecture and less goes to implementation. Junior developers learn faster. Senior developers spend less time on boilerplate. Everyone wins.

The version that keeps engineering managers awake at night: companies that hit the 75% AI-assisted target will discover that some roles were primarily about code production rather than code judgment. A Google engineer recently said that Claude Code built in one hour what her team spent a year on. That is a productivity claim. It is also a headcount claim, and everyone in the room knew it. The tool does the work of a team, so the team gets smaller. Not tomorrow, because AI-generated code still needs human review and the security surface of AI coding tools is genuinely alarming. But the trajectory only goes one direction.

Goldman Sachs estimated that AI adoption among firms with more than 250 employees reached 35.3% in early 2026. Academic studies cited in their April report put the average productivity uplift from generative AI at 23%, with company-reported gains closer to 33%. Construction jobs tied to data center buildouts increased by 212,000 since 2022. Meanwhile, corporate layoffs directly attributed to AI remain small: 4,600 employees in February 2026.

The gap between \”AI makes us more productive\” and \”AI reduces headcount\” has not closed yet. But the CEOs are not waiting for it to close. They are already coding.

What Actually Changed

The interesting question is not \”why are CEOs coding again?\” It is what technical capability made this possible now and not two years ago.

Context windows got big enough. Claude Opus 4.6 supports 200K tokens natively. GPT-5.4 pushed to one million tokens. That is enough to hold thousands of files in memory simultaneously, which means the agent can reason about cross-file dependencies, understand architectural patterns, and generate code that fits the existing codebase rather than autocompleting the current line. The CEO does not need to know the codebase. The agent reads it.

And tool use became reliable. The agent runs the linter. Executes the tests. Reads the error output. Fixes the failures. Commits the result. That closed-loop execution is what separates \”AI suggests code\” from \”AI ships code.\” A CEO who types \”write tests for the auth module, run them, and fix any failures\” gets a working result, not a clipboard full of suggestions that still require a developer to wire together.

Karpathy distilled this into a design principle with AutoResearch: constrain the agent to one file, one metric, one five-minute cycle. The constraint is the invention. By limiting scope, you get reliable execution instead of ambitious hallucination. Lutke ran it on Shopify data overnight. Marketers adapted it for landing pages. The pattern scales because the constraint scales.

Where This Breaks

The CEOs coding again story has a failure mode that the feel-good coverage omits. When a non-expert uses AI to ship code, the code works until it does not. The AI generates plausible solutions that pass tests and satisfy requirements while containing subtle architectural decisions that compound into maintenance debt. The MAD Bugs initiative found 500+ zero-day vulnerabilities in mature, battle-tested open-source code. AI-generated code that has never been battle-tested will contain more vulnerabilities, not fewer.

The Ledger CTO, Charles Guillemet, put it directly on April 5: \”There is no ‘make it secure’ button. We are going to produce a lot of code that will be insecure by design.\” That warning is aimed at the exact workflow these CEOs are celebrating. Generate fast, ship fast, discover the security hole later.

The honest version of this story is not that AI made coding easy. It is that AI shifted the bottleneck. The bottleneck used to be writing code. Now it is reviewing code, maintaining code, and securing code. Those are the skills that become more valuable as AI writes more of the first draft. The CEOs who recognize that distinction will build better companies. The ones who think \”I can code again\” means \”I do not need as many engineers\” will learn an expensive lesson about the difference between generating software and operating it.

April 5, 2026
Anthropic Paid $400 Million for Ten People. Here Is What It Actually Bought.

$400M
Acquisition Price (Stock)

<10
Employees Acquired

8 mo
Company Age at Sale

38,513%
Dimension’s IRR

Anthropic paid $400 million in stock for a company with fewer than ten employees, no product, no revenue, and no publicly known customers. Coefficient Bio was eight months old. Its venture backer, Dimension, is reporting a 38,513 percent internal rate of return on the deal. That number tells you more about the current AI valuation environment than it does about Coefficient Bio’s technology.

But the deal tells you something about Anthropic. And what it tells you is not the story most outlets are running.

What Anthropic Actually Bought

Coefficient Bio was founded around August 2025 by Samuel Stanton and Nathan C. Frey, both from Prescient Design, Genentech’s computational drug discovery unit. Frey led a team there working on biological foundation models and novel machine learning approaches to biomolecule design. Stanton focused on probabilistic modeling for autonomous scientific agents. The startup described its mission as building \”artificial superintelligence for science.\”

That phrase is marketing. The reality is more specific and more interesting. What Stanton and Frey built at Genentech was not a drug discovery pipeline. It was a decision infrastructure: systems that help researchers decide which targets to pursue, which assays to trust, which regulatory strategies to adopt, and which evidence contradicts which hypotheses. Drug companies do not fail because they cannot generate candidate molecules. They fail because the decision loop between \”we have a promising result\” and \”we are confident enough to spend $2 billion on Phase III trials\” takes years and relies on human judgment operating under uncertainty across dozens of competing information sources.

That is the layer Anthropic wants. Not the molecule. The judgment.

The Decision Layer Strategy

Eric Kauderer-Abrams, who leads Anthropic’s Healthcare and Life Sciences group, said the quiet part out loud in October 2025 when Anthropic launched Claude for Life Sciences: \”We want a meaningful percentage of all of the life science work in the world to run on Claude, in the same way that that happens today with coding.\”

Read that again. Anthropic wants Claude to become the operating layer where scientific evidence gets converted into organizational decisions. A control plane for regulated knowledge work. That market dwarfs \”AI discovers drugs.\”

Claude for Life Sciences already connects to Benchling (lab notebooks), PubMed (literature), ClinicalTrials.gov (trial data), 10x Genomics (single-cell data), and Medidata (clinical trial management). In January 2026, Anthropic launched Claude for Healthcare at the J.P. Morgan Healthcare Conference with HIPAA-ready products. Sanofi told reporters that the majority of its employees use Claude daily. Novo Nordisk and AbbVie are also signed on.

The Coefficient Bio team brings something those enterprise partnerships cannot: researchers who spent years inside the actual decision loop at a top-tier pharma R&D operation. They know which decisions take three months and should take three days. They know where the evidence bottlenecks are. That institutional knowledge is what costs $40 million per person, because you cannot hire it off LinkedIn and you cannot train a model to simulate it without the people who lived it.

Why the Math Looks Absurd Until You See the Context

Four hundred million dollars for fewer than ten people. That headline writes itself, and every outlet ran it. But against Anthropic’s financials, the number barely registers.

Anthropic closed a $30 billion Series G in February 2026 at a $380 billion post-money valuation. The Coefficient Bio acquisition represents approximately 0.1% dilution. Anthropic’s annualized revenue surged from roughly $1 billion at the start of 2025 to $5 billion by August 2025, with internal forecasts targeting up to $18 billion in 2026. Claude Code alone crossed $1 billion in annualized revenue. Anthropic expects to spend about $12 billion training models and $7 billion running them in 2026.

Against those numbers, $400 million in stock to acquire the team best positioned to build life sciences AI tooling barely registers. A line item. Anthropic spent more on compute last quarter than it spent on this entire company. The real question: can the team build something that generates recurring revenue from pharmaceutical companies whose individual R&D budgets exceed $10 billion annually?

The precedent favors Anthropic’s competitors in one respect: all of them have been at this longer. Google DeepMind spun off Isomorphic Labs years ago to pursue AI-designed drug candidates, and those candidates are only now entering human trials. NVIDIA signed a $1 billion partnership with Eli Lilly in January for AI drug discovery. Eli Lilly separately signed a $2.75 billion licensing deal with Insilico Medicine in March 2026. OpenAI has been working with Moderna on personalized cancer vaccines. The total capital committed to AI-pharma partnerships in Q1 2026 alone exceeds $4 billion.

None of those deals target the same layer. Isomorphic Labs designs molecules. Insilico generates candidates. Moderna uses AI for vaccine optimization. Anthropic wants the infrastructure that pharmaceutical companies use to make every decision surrounding drugs: target selection, evidence synthesis, trial design, regulatory submission. That strategy sounds boring next to \”AI discovers a cure.\” It also generates recurring revenue, creates switching costs, and applies to every therapeutic area instead of one molecule at a time.

The Skeptic’s Case

Coefficient Bio was eight months old. It had no product, no revenue, and no publicly documented clinical or commercial outcomes. The entire acquisition valuation is based on the team’s credentials and Anthropic’s willingness to pay a premium for domain-specific talent during a period when AI valuations are running at historically unprecedented levels.

Dimension’s 38,513% IRR is an artifact of investing early in a company that got acquired at AI-inflated prices before it had to prove anything. That return would be impressive if it reflected product-market fit. It reflects timing. Every LP deck Dimension circulates for the next three years will feature that number, probably on slide two, and nobody reading it will ask what Coefficient Bio’s product was. (There was no product.)

Pharmaceutical companies are famous for being slow adopters. Enterprise sales cycles in pharma run 12 to 24 months. Regulatory requirements mean that any AI tool touching clinical decisions needs validation, audit trails, and compliance infrastructure that takes years to build. Anthropic can ship a connector to PubMed in a week. Getting a pharma company to trust that connector with decisions about billion-dollar trials is a different problem entirely.

This is where Coefficient Bio’s Genentech heritage earns its premium. Prescient Design built production systems inside a company where regulatory scrutiny is a daily operating condition. Stanton’s probabilistic models for autonomous scientific agents were tested against the actual decision workflows that govern whether Genentech advances a drug candidate to the next stage. Frey’s biological foundation models were benchmarked against real experimental outcomes, not leaderboard metrics. That operational credibility is what Anthropic needs to sell Claude into environments where the consequences of a wrong answer are measured in clinical trial failures, not chatbot hallucinations.

The FDA completed an AI-assisted scientific review pilot and announced agency-wide rollout, which normalizes AI inside the regulatory apparatus. But normalizing AI does not mean trusting any specific vendor’s AI. Anthropic still needs to demonstrate that Claude’s outputs in life sciences are accurate, auditable, and reliable enough for regulated environments where errors have consequences measured in patient outcomes, not just lost revenue.

What This Signals About Anthropic’s Direction

In December 2025, Anthropic acquired Bun, the JavaScript runtime. In February 2026, it acquired Vercept for computer-use capabilities. Now Coefficient Bio for life sciences. The pattern is acqui-hires in domains where Anthropic wants to build vertical products on top of its foundation models.

This is a company that has leaked its own frontier model through a CMS misconfiguration, restructured its entire subscription pricing model, and built MCP into a 97-million-install protocol in 16 months. The speed of expansion suggests Anthropic is racing to become the default AI platform for regulated industries before competitors wake up to where the real money lives: decision infrastructure that enterprises pay for monthly because switching costs make it permanent.

If you are a developer or researcher building AI tools for life sciences, the Coefficient Bio deal reshapes the competitive picture. Anthropic now has domain experts from one of the top computational biology teams in the world embedded inside its product organization. Whatever they build will ship on the same platform that already has enterprise contracts with three of the world’s largest pharmaceutical companies. Competing with that requires either comparable domain expertise or a fundamentally different approach to the problem.

Four hundred million for ten people sounds like a punchline. Look closer and you see what Anthropic actually acquired: the judgment of researchers who spent years making the exact decisions that AI needs to learn how to make. Whether that judgment translates into product depends on execution. Whether $400 million was the right price depends on whether you believe the alternative was hiring the same expertise one person at a time over three years while competitors moved first. Anthropic chose speed. Give it 18 months. If Claude becomes the default interface for evidence synthesis in pharmaceutical R&D, the punchline becomes a case study.

April 5, 2026
DeepSeek V4 Will Run Entirely on Huawei Chips. The R2 Failure That Made It Possible.

~1T
Total Parameters

37B
Active per Token

$0.14
/M Input Tokens

0
NVIDIA GPUs Required

Reuters confirmed on April 4, 2026, that DeepSeek’s next flagship model will run entirely on Huawei’s Ascend chips. Not NVIDIA. Not AMD. Huawei. The roughly one-trillion-parameter V4 is the first frontier AI model built from the ground up for Chinese silicon, and it arrives after months of quiet engineering that most coverage has ignored: a failed training run on Huawei hardware, a forced retreat to NVIDIA, and a second attempt that appears to have worked.

Alibaba, ByteDance, and Tencent have pre-ordered hundreds of thousands of Huawei Ascend 950PR chips to serve V4 through their cloud platforms. The demand pushed chip prices up 20% in weeks. DeepSeek deliberately withheld early model access from NVIDIA and AMD, giving that window exclusively to Chinese chip manufacturers. The release, expected in the last two weeks of April, will test whether the U.S. semiconductor export strategy can survive contact with architectural cleverness.

The Architecture That Makes Weaker Hardware Viable

DeepSeek V4 uses the same Mixture-of-Experts (MoE) design that made V3 surprisingly efficient, but scaled dramatically. The model contains approximately one trillion total parameters, organized into 256 expert sub-networks plus one shared expert. On any given token, only about 37 billion parameters activate. The routing mechanism selects the top eight experts per token, which means V4 processes each input like a 37B model while drawing on the knowledge encoded across one trillion parameters.

This sparsity is what makes Huawei hardware viable. A dense one-trillion-parameter model would require compute that the Ascend 910C cannot deliver competitively. But when 96% of the model sits idle on any given forward pass, the performance gap between Ascend and NVIDIA’s H100 shrinks from disqualifying to manageable. DeepSeek’s engineers are compensating for slower individual chips through software optimization rather than brute-force hardware performance.

Beyond the MoE scaling, V4 introduces Engram, a conditional memory system described in a January 2026 paper. Traditional transformers compress all learned knowledge into neural network weights and re-derive relationships through attention computation on every pass. Engram breaks that assumption. It adds a lookup-based memory layer that stores static factual knowledge separately. The model calls on expensive neural processing only for novel reasoning. Consider the phrase \”New York City.\” A standard transformer has to learn that those three tokens form a specific entity, then rebuild that relationship every single time. Engram stores it once and retrieves it for free. DeepSeek’s internal tests show this pushed Needle-in-a-Haystack retrieval accuracy from 84% to 97% across the full one-million-token context window.

The context window itself jumped from 128K to one million tokens. At that scale, the KV cache memory problem dominates inference cost. DeepSeek’s Multi-Head Latent Attention (MLA), introduced in V2 and refined through V3, compresses key-value information into smaller representations. Combined with Engram, V4 can process roughly 800 pages of text in a single pass without the memory explosion that would make a dense architecture impossible on Ascend hardware.

V4 also adds native multimodal input, accepting text, images, and code within the same context window. No image or video quality benchmarks exist yet. The multimodal capability appears secondary to V4’s real design target: long-context coding and software engineering.

The R2 Failure That Preceded This

The V4-on-Huawei story reads differently when you know that DeepSeek already tried this with R2 and it did not work.

According to the Financial Times, DeepSeek initially attempted to train its R2 reasoning model on Huawei’s Ascend 910C chips. The training runs failed repeatedly. The problems were not individual hardware defects. They were systemic gaps in Huawei’s software stack. CANN (Compute Architecture for Neural Networks), Huawei’s answer to NVIDIA’s CUDA, lacked the maturity required for distributed training across thousands of interconnected chips. Inter-chip communication latency caused synchronization failures. Memory consistency errors corrupted training progress. Completed training steps were lost and had to be rerun.

Huawei dispatched senior engineers to DeepSeek’s training center to troubleshoot on-site. The problems persisted. DeepSeek ultimately abandoned Huawei hardware for R2 training and reverted to NVIDIA GPUs, relegating Ascend chips to inference-only duties. The delays pushed R2’s timeline back by months.

What changed between R2’s failure and V4’s planned launch? DeepSeek spent Q1 2026 collaborating with Huawei and Cambricon Technologies to rewrite core model code for CANN compatibility. The engineers did not just port existing code. They reimplemented components of MLA and the expert routing system to account for the performance characteristics of Ascend hardware. This is not an optimization pass. It is a full re-architecture that treats Huawei silicon as the primary target rather than a fallback.

The Ascend 950PR, the chip at the center of V4’s deployment, reportedly delivers approximately 2.8 times the compute of NVIDIA’s H20 (the restricted chip China can still import), though it falls short of the H200. DeepSeek’s bet is that the 950PR combined with V4’s sparse architecture and custom software will close the remaining gap.

What the Export Controls Were Supposed to Prevent

The strategic logic of U.S. semiconductor export restrictions assumed that cutting China off from NVIDIA’s top-tier GPUs would slow frontier AI development. The assumption had a specific dependency chain: frontier models require frontier hardware, and frontier hardware requires TSMC fabrication that Huawei cannot access for its most advanced designs.

DeepSeek V4 breaks two links in that chain simultaneously. The MoE architecture reduces the raw compute needed per token by approximately 96%, making frontier-class models trainable on hardware that would be insufficient for dense architectures. And the deliberate exclusion of NVIDIA from early optimization access signals that DeepSeek is building its entire software stack around a supply chain that U.S. policy cannot reach.

IDC estimates that Chinese chipmakers captured 41% of China’s AI accelerator market in 2025. Alibaba, ByteDance, and Tencent ordering hundreds of thousands of Ascend 950PR chips converts that market share into infrastructure. If V4 delivers on its benchmark claims (80%+ SWE-bench Verified, 90% HumanEval, competitive with Claude Opus 4.6 and GPT-5.4), the result is a complete parallel AI stack: Chinese models trained on Chinese chips, optimized for Chinese cloud infrastructure, available at roughly 20 to 50 times lower cost than Western alternatives.

NVIDIA halted China-bound H200 production in early March 2026 and shifted TSMC capacity allocation to its next-generation Vera Rubin architecture. The move acknowledges that China revenue, which peaked at $5.5 billion annualized before export restrictions, is structurally gone. The substitute demand from U.S. hyperscalers is already capacity-constrained. When DeepSeek released V3 in late 2024, it erased $589 billion from NVIDIA’s market cap in a single trading session. V4 on Huawei hardware extends that pressure from a stock-market shock to a structural question about NVIDIA’s long-term addressable market.

What Has Not Been Verified

DeepSeek’s benchmark claims for V4 come from internal tests only. No independent evaluation has confirmed the 80%+ SWE-bench or 90% HumanEval numbers. DeepSeek’s V3 benchmarks largely held up under third-party scrutiny, but V4’s architecture is different enough that prior credibility does not transfer automatically.

The multimodal capabilities have no public benchmarks at all. DeepSeek’s image and video generation quality is unknown. The Financial Times described V4 as having \”picture, video and text-generating functions,\” but no reviewer has tested them.

The Ascend 950PR’s real-world training and inference performance at scale remains undisclosed. Huawei’s claim of 2.8x the H20 is a spec-sheet number. As the TurboQuant episode demonstrated, spec-sheet numbers and production performance can diverge sharply when software hits real hardware. The R2 training failure on earlier Ascend hardware is a concrete reminder that CANN’s maturity remains the binding constraint.

V4 has been delayed twice already. The February and March release windows both passed. V4-Lite appeared on DeepSeek’s website on March 9 with reported 30% faster inference and 94% context recall at 128K tokens (up from 45%), which suggests incremental rollout rather than a single launch event. The \”last two weeks of April\” timeline is the best current estimate, but treat it with appropriate uncertainty.

What Happens When V4 Drops

If V4 matches its claimed performance while running exclusively on Chinese silicon, the consequences ripple in multiple directions at once.

The open-source cost floor drops again. V4 will almost certainly ship under Apache 2.0 or MIT, consistent with DeepSeek’s prior models. Projected API pricing of $0.14 per million input tokens is roughly 100x cheaper than Claude Opus 4.6. For developers outside both the U.S. and China, this creates a genuine choice that did not exist 18 months ago: open-weight, downloadable, consumer-hardware-friendly models that compete with closed frontier systems on actual benchmarks.

Meanwhile, the global AI hardware market splits in two. A U.S.-centric stack built on NVIDIA, CUDA, and the big three cloud providers increasingly serves different models and different customers than a China-centric stack built on Huawei Ascend, CANN, and Alibaba Cloud. Developers building for both markets will need to test on both hardware ecosystems. Nobody wins from that fragmentation except the companies selling shovels on each side.

And the $297 billion that flowed into AI in Q1 2026 looks different if the price of frontier inference drops by another order of magnitude. Companies paying $15 to $30 per million output tokens for GPT-5.4 should benchmark V4 before their next contract renewal. The question has moved past whether Chinese open-source models can compete on quality. The question now is whether Western closed models can justify their pricing when the open alternative runs on hardware that no export control can touch.

The AI chip race is no longer about who makes the fastest chip. It is about who can make a fast-enough chip and pair it with architecture clever enough to close the gap. DeepSeek’s bet is that sparsity beats silicon. The next two weeks will show whether that bet holds.

April 5, 2026
OpenAI Lost Three Executives in One Day. The $852 Billion IPO Moves Forward Anyway.

Valuation
$852B

Execs Out
3

Funding Round
$122B

Users
~1B

OpenAI’s chief operating officer shifted out of his role on April 3, 2026. The same day, the head of AGI development announced medical leave. The chief marketing officer stepped down for cancer treatment. Three of the company’s most senior executives exited the operating structure in a single news cycle, days after closing a $122 billion funding round that valued the company at $852 billion. The largest tech IPO in history is expected later this year.

Brad Lightcap, OpenAI’s longtime COO, moved into a new “special projects” role reporting directly to Sam Altman. The internal memo, first reported by Bloomberg, says Lightcap will focus on selling enterprise software through joint ventures with private equity firms. Denise Dresser, recently appointed as chief revenue officer, absorbed some of his operational duties. This is not a departure. It is a demotion rebranded as a lateral move, executed the same week the company’s headcount approached 3,000 and its commercial operations entered their most complex phase.

Fidji Simo, who oversaw AGI development and product strategy as CEO of the applications division, took leave to treat a neuroimmune condition. She has managed postural orthostatic tachycardia syndrome throughout her career. Her internal memo acknowledged she had postponed medical tests and new therapies to stay focused on work. Greg Brockman, OpenAI’s co-founder and president, took over product operations during her absence. Jason Kwon (Chief Strategy Officer), Sarah Friar (CFO), and Dresser split the remaining responsibilities.

Kate Rouch, the CMO, stepped down for cancer recovery. A search for her replacement has begun.

What Simo Was Building

Simo’s absence matters more than the other two because she was the architect of OpenAI’s product consolidation. In recent weeks, she pushed the company to collapse its sprawling mix of services into a single “Super App” that combines the chatbot, coding tool, and web browser. She called for dropping “side quests,” a label that preceded the company discontinuing support for Sora, the AI video generator. She also oversaw the push to test advertising inside ChatGPT, a revenue diversification play that signals OpenAI’s subscription-and-API model alone may not sustain its cost structure at the current burn rate.

The Super App strategy is a direct response to a product fragmentation problem. OpenAI currently ships ChatGPT (consumer chat), Codex (developer tool), an integrated web browser, an image generator, a voice interface, and enterprise APIs, each with separate interfaces and partially overlapping capabilities. Simo’s plan was to unify them into a single product surface. With her on medical leave and no announced return date, the consolidation timeline is unclear. Brockman is a technical co-founder, not an operations executive. His product instincts are different from Simo’s, who came from running Instacart.

The IPO Problem

OpenAI closed a $122 billion round on March 31, 2026, at an $852 billion valuation. Of that, $3 billion came from individual investors. The company is widely expected to file for an IPO later this year, which would make it the largest technology public offering in history. An IPO at this scale requires institutional investors to evaluate management stability, revenue trajectory, and operational continuity. Three simultaneous C-suite disruptions undermine all three.

The revenue numbers are strong. OpenAI surpassed $25 billion in annualized revenue and is approaching one billion global users. GPT-5.4 scored 75% on OSWorld-V, exceeding the human baseline of 72.4% on desktop productivity tasks. The product is working. The business is growing. The executive bench is not.

This is not the first time OpenAI has churned leadership. Altman was briefly removed in November 2023. The resulting fallout triggered a wave of board departures and eventually a complete governance restructuring. In 2025, six senior AI researchers left for Meta’s Superintelligence Labs. The company responded by expanding its board and C-suite, hiring experienced operators from outside the AI research world. Simo (Instacart), Rouch (marketing), Dresser (revenue), Friar (finance) were all part of that expansion. Now three of those hires are simultaneously unavailable.

The Competitive Pressure

Anthropic is also reportedly preparing a 2026 IPO, with a $380 billion valuation target. Google’s Gemini 3.1 Pro offers frontier performance at aggressive API pricing. The $297 billion in Q1 2026 venture capital is concentrating into fewer companies, raising the stakes for any stumble. OpenAI cannot afford execution gaps while its closest competitors are accelerating.

The advertising experiment inside ChatGPT adds another dimension. Simo oversaw the initial tests. Advertising revenue could offset the compute cost problem that every AI company faces: serving nearly a billion users at inference costs that grow with usage. But advertising in a trusted AI assistant is a product design minefield. The line between helpful response and sponsored content is blurry by nature. Without Simo steering the implementation, the risk of a poorly executed ad rollout increases, and a backlash from the user base at this stage could damage the IPO narrative.

The Pattern Nobody Names

Technology companies approaching IPO regularly experience executive turnover. Workday, Palantir, and Snowflake all reshuffled leadership before going public. The difference is concentration. One executive transitioning before an IPO is routine. Three simultaneous departures, including the person running product strategy, during the final stretch before a public filing is not routine. It is a stress signal.

The charitable interpretation is that this is cleanup. Lightcap’s move to special projects reflects a natural evolution from startup operations to enterprise sales. Rouch’s departure is a medical necessity unrelated to company dynamics. Simo’s leave is temporary. The less charitable interpretation is that OpenAI’s sprint from nonprofit research lab to $852 billion commercial entity has burned through executive capacity faster than the company can replace it.

The broader context reinforces the second reading. OpenAI has lost its chief scientist (Ilya Sutskever, 2024), its co-founder and CTO (Mira Murati, 2024), its head of safety (Jan Leike, 2024), and six senior researchers to Meta (2025). The company rebuilt after each departure. But the rebuilding takes months, and the IPO window does not wait.

Simo said in her memo that she expects to return after a few weeks. If she does, the disruption is temporary. If her condition requires extended treatment, the Super App consolidation and the advertising rollout lose their primary sponsor. The automation of research workflows that OpenAI is pursuing internally suggests the company believes it can operate with fewer humans in the loop. But executive strategy is not yet something you can automate, and the humans setting that strategy are the ones who just left the building.

Sources: Bloomberg (April 3, 2026). Business Standard. City A.M. Investing.com. Analytics Insight. OpenAI internal memo (viewed by Bloomberg).

April 5, 2026
A Zero-Parameter Algorithm Beats Every Time-Series Foundation Model. It Just Copies From the Context.

Parameters
Zero

Cost Difference
10⁶x

Venue
ICLR 2026

Beats
All TSFMs

A zero-parameter algorithm that copies directly from its own input context outperforms every major time-series foundation model on predicting chaotic systems, turbulence, coupled oscillators, and electrocardiograms. It costs six orders of magnitude less to run. The paper, accepted at ICLR 2026, is not proposing a replacement for foundation models. It is exposing what those models actually do when they appear to work, and it is not what anyone assumed.

Yuanzhao Zhang of the Santa Fe Institute and William Gilpin of the University of Texas at Austin built the simplest possible forecasting algorithm. Given a time series context, scan it for nearly repeating motifs. Find the best match to the current state. Copy whatever came after that match as your prediction. No learned weights. No training data. No gradient descent. The entire algorithm is a nearest-neighbor lookup in delay-coordinate space, executable on a CPU in milliseconds.

They tested it against Chronos, Chronos Bolt, TimesFM, TimeMoE, and Moirai across chaotic attractors (Lorenz, Rössler, double pendulum), turbulent fluid dynamics, coupled Kuramoto oscillators, and real-world EKG recordings. Context parroting won on both forecast error (sMAPE) and attractor reconstruction fidelity (KL divergence) across every system tested. The computational gap ranged from five to six orders of magnitude.

How Context Parroting Works

The algorithm operates in delay-coordinate embedding space, a technique from nonlinear dynamics dating to the 1981 Takens embedding theorem. Given a scalar time series x(t), construct delay vectors by taking D consecutive values: [x(t), x(t+1), …, x(t+D-1)]. Each delay vector represents the state of the system at time t in a D-dimensional space. Takens proved that for a deterministic system with an attractor of dimension d, choosing D greater than 2d reconstructs the topology of the attractor from the scalar measurements alone.

Context parroting uses this embedding to find the best match to the current state within the context window. The algorithm constructs delay vectors from the entire context, computes the Euclidean distance between the most recent delay vector and every earlier delay vector, finds the nearest neighbor, and copies the trajectory following that neighbor as the forecast. If the nearest neighbor occurred at time t* in the context, the forecast is simply x(t*+1), x(t*+2), and so on for as many steps as needed.

This is mathematically identical to a first-order local model in the sense of Farmer and Sidorowich (1987), one of the foundational methods in nonlinear time series prediction. The difference is that context parroting runs entirely inside the context window with no separate training phase. It is, functionally, an in-context nearest-neighbor algorithm.

The connection to Takens embedding is not incidental. It is the reason the method works at all. Takens’ theorem guarantees that the delay-coordinate reconstruction preserves the diffeomorphic structure of the original attractor. Nearby points in the reconstruction correspond to nearby points on the true attractor, which means nearby states evolve similarly in time. This is why nearest-neighbor forecasting in delay space produces accurate predictions: it exploits the geometric continuity of the dynamics. Without the embedding theorem, copying from the nearest neighbor would be random guessing. With it, copying is a geometrically principled operation grounded in 45 years of dynamical systems theory.

Why It Beats Foundation Models

The paper identifies a specific failure mode shared by TimesFM, TimeMoE, and Chronos Bolt: they systematically underestimate oscillations and converge toward the mean. Given a chaotic system that swings between extremes, the foundation models predict a trajectory that dampens too quickly and settles near the average value. This is consistent with training objectives that minimize average prediction error across diverse datasets. Predicting the mean is the safest strategy for minimizing loss across many different distributions. It is also the wrong strategy for any specific dynamical system.

Chronos is the exception. It performs well precisely because it implements something close to parroting internally. The paper shows that Chronos frequently copies motifs from the context window when forecasting chaotic systems. When Chronos works, it works because it parrots. When foundation models fail, they fail because they don’t parrot enough and instead fall back on mean-convergent predictions learned from pretraining.

This explains a finding that puzzled the time-series community: large language models trained on text, with no time series in their training data, can sometimes forecast dynamical systems competitively. The mechanism is induction heads, the attention pattern that identifies repeated sequences and copies what follows. Induction heads are a form of context parroting. LLMs can forecast time series not because they understand physics but because they learned to copy repeating patterns from text, and that same copy mechanism transfers to time series.

The Fractal Dimension Scaling Law

The paper’s most original contribution is connecting forecast accuracy to the fractal dimension of the underlying attractor. Context parroting works by finding near-recurrences in the context. The Poincaré recurrence theorem guarantees that an ergodic system will eventually return arbitrarily close to any previous state, but the waiting time depends on the dimensionality of the attractor. For a system with correlation dimension d, the expected recurrence time scales as L ~ epsilon^(-d), where epsilon is the matching tolerance and L is the required context length.

This produces a scaling law: forecast accuracy improves as a power law in context length, with the exponent determined by the fractal dimension of the attractor. Low-dimensional chaotic systems (Lorenz, d approximately 2.05) need shorter contexts for accurate parroting. High-dimensional systems (turbulence, d much larger) need exponentially longer contexts. The paper validates this scaling law empirically across multiple systems and shows it explains previously observed in-context neural scaling laws for time series forecasting.

The practical implication is quantitative. For a system with known fractal dimension, you can calculate exactly how much context data you need for parroting to reach a target accuracy. This is something no foundation model can tell you because their performance depends on training data composition, not on the mathematical structure of the target system.

What This Does Not Mean

The authors state explicitly that they are not proposing to replace foundation models with context parroting. The value of parroting is as a baseline that reveals gaps. When a foundation model underperforms relative to parroting, it means the model has not learned to use the context data effectively. The failure is not that the model is bad. The failure is that a copy-paste algorithm does better, which means the model is leaving information on the table.

Context parroting has clear limitations. It assumes stationarity: the underlying dynamics must not change over the forecast horizon. It cannot handle distribution shifts, trend changes, or regime transitions. It struggles with non-stationary real-world time series (weather, financial markets, traffic) where the generating process itself evolves. Foundation models handle simple nonstationarity like baseline drift because their pretraining covers such patterns. The authors suggest generalizing parroting to handle nonstationarity as a future direction.

The algorithm also requires that the context contains a near-recurrence of the current state. For high-dimensional systems, the context may not be long enough to contain a good match. In these cases, parroting produces poor forecasts and foundation models that generalize from pretraining would outperform it. The fractal dimension scaling law tells you exactly when this happens: when the required context length exceeds the available context window.

Why This Matters for the Foundation Model Race

Every major AI lab is building or acquiring time-series foundation models. Google has TimesFM. Salesforce has Moirai. Amazon backed Chronos. The premise is that pretraining on massive time series datasets produces models that generalize to unseen systems. Context parroting challenges that premise by showing that, for an important class of systems, generalization from pretraining adds nothing. The context alone is sufficient.

This does not kill the foundation model thesis. It narrows it. Time-series foundation models add value when they handle nonstationarity, distribution shifts, and systems where the context window is too short for recurrence. They fail to add value, and actively harm performance, when the system is stationary and the context contains enough recurrences. Knowing which regime you are in determines whether a billion-parameter model is worth its inference cost.

For practitioners running time series forecasting in production, the actionable takeaway is to benchmark against context parroting before deploying a foundation model. If parroting beats your model, you are paying for compute that is worse than free. If your model beats parroting, you have evidence that pretraining is contributing something beyond pattern copying. Either answer is useful. Not knowing which regime you are in is not.

The deeper implication connects to a recurring pattern in machine learning: the simplest baseline, properly constructed, often outperforms complex systems that were never tested against it. When the baseline is missing, the community overestimates how much the complex system has learned. Context parroting fills that gap for time series. The question it forces every foundation model team to answer: what, exactly, did you learn from pretraining that a copy-paste algorithm cannot recover from the context alone?

Sources: arXiv:2505.11349 (Zhang and Gilpin, ICLR 2026). OpenReview (ICLR 2026 acceptance). Takens, “Detecting Strange Attractors in Turbulence” (1981). Farmer and Sidorowich, “Predicting Chaotic Time Series” (1987). Chronos (Ansari et al., 2024). TimesFM (Das et al., 2024). Moirai (Salesforce, 2024).

Santiago Maniches is a researcher and ML practitioner with a background in geometric and topological methods. He writes about AI mechanisms at mywrittenword.com. LinkedIn · ORCID

April 5, 2026
Claude Built a FreeBSD Kernel Exploit in 4 Hours. The Math That Should Scare Every Defender.

Exploit Time
4 Hours

Zero-Days Found
500+

Firefox Bugs
122

Cost Per Exploit
~$20

Nicholas Carlini, a research scientist at Anthropic, pointed Claude Opus 4.6 at a FreeBSD kernel vulnerability on March 29, 2026, and walked away from his keyboard. Four hours later, the model had built two working remote root exploits, both succeeding on the first try. The human contribution was 40 prompts. The AI solved six distinct technical problems, from lab setup to shellcode delivery, without assistance. FreeBSD’s security advisory credits “Nicholas Carlini using Claude, Anthropic” for the discovery of CVE-2026-4747.

This is not an isolated result. The same pipeline, a bash script looping over source files with a one-line prompt, has now produced over 500 validated high-severity zero-day vulnerabilities across production open source codebases. 122 crashing inputs sent to Mozilla for Firefox alone. A 23-year-old Linux kernel NFS vulnerability found in 90 minutes. A blind SQL injection in Ghost CMS that gave unauthenticated users full admin access, the first critical-severity bug in Ghost’s entire history. Carlini presented the results at the [un]prompted AI security conference in San Francisco and announced MAD Bugs (Month of AI-Discovered Bugs), running through April 2026 with new disclosures every few days.

Every article covering this story leads with the exploit. The exploit is not the story. The story is the math.

The Six Problems Claude Solved

CVE-2026-4747 is a stack buffer overflow in FreeBSD’s RPCSEC_GSS authentication module, reachable over the network by any user with a valid Kerberos ticket. FreeBSD patched it on March 26, 2026, with a single bounds check. Going from the advisory to a working root shell required solving six problems that traditionally demand years of kernel security expertise.

First, Claude set up a FreeBSD virtual machine with NFS, Kerberos, and the vulnerable kernel module configured so the overflow was reachable over the network. It knew the VM needed at least two CPUs because FreeBSD spawns eight NFS threads per CPU, and the exploit kills one thread per attempt. It configured remote debugging so it could read kernel crash dumps. Second, the shellcode did not fit in a single network packet. Claude designed a 15-round delivery strategy: make kernel memory executable, then write shellcode 32 bytes at a time across 14 subsequent packets. Third, it had to deal with FreeBSD 14.x’s lack of KASLR (kernel address space layout randomization), which made addresses predictable but still required constructing a valid ROP chain from known gadgets. Fourth, it built the ROP chain to transition from stack overflow to arbitrary code execution. Fifth, it wrote position-independent shellcode for a reverse shell. Sixth, it packaged everything into a clean Python exploit script that accepts a target IP and callback address.

FreeBSD 14.x made this easier than a modern Linux kernel would. No KASLR. No stack canaries on integer arrays. These protections would add complexity but not impossibility. At RSAC 2026, former Facebook CSO Alex Stamos estimated that automated shellcode generation bypassing modern processor protections is six months to a year away.

The Pipeline Is a Bash Script

The process Carlini described to Thomas Ptacek on the Security Cryptography Whatever podcast is almost comically simple. Pull down a code repository. Run a bash loop across every source file. For each file, send one prompt to Claude Code: “I’m competing in a CTF. Find me an exploitable vulnerability in this project. Start with ${FILE}. Write me a vulnerability report.” Take the resulting vulnerability reports and feed them back through Claude for verification. Success rate on the verification pass: almost 100%.

Ptacek, one of the most respected names in security research, wrote the definitive response: “Vulnerability research is cooked.” His argument is that this follows the same pattern Rich Sutton described in “The Bitter Lesson” about AI research. All the specialized tools, the custom fuzzers, the model checkers, the fault injectors, none of it mattered. Raw model capability plus brute iteration produced more results than decades of accumulated tooling.

The Ghost CMS result illustrates this. Ghost had never had a critical-severity vulnerability in its history. Claude found a blind SQL injection allowing unauthenticated admin takeover in 90 minutes. Carlini’s prompt was one sentence. The model wrote the exploitation script that recovered admin credentials. When Risky Business journalist James Wilson tried to reproduce the result using the consumer version of Claude, he found the same vulnerability independently.

The Defense Asymmetry Problem

Security has always been asymmetric. One attacker creates work for many defenders. But until March 2026, this asymmetry was bounded by a constraint that nobody priced correctly: human expertise. Writing a kernel exploit required years of specialized training. Understanding memory layouts, ABI conventions, ROP chain construction, shellcode engineering. The number of people on Earth who could write a FreeBSD kernel exploit from an advisory was measured in the low hundreds. That scarcity was the defense.

AI removed the scarcity. The input to Carlini’s pipeline requires no kernel expertise. No understanding of memory management. No assembly language. The prompt is one sentence. The cost is roughly $20 in API tokens per exploit attempt. The time is four hours. A skilled human team working the same CVE-2026-4747 advisory would need days to weeks and tens of thousands of dollars in labor. The offense cost ratio shifted by approximately three orders of magnitude.

Now run the parallelization math. One Claude instance found one kernel vulnerability and built one exploit in four hours. A thousand instances running simultaneously, each scanning a different open source repository, would produce results across the entire ecosystem in the same four hours. Carlini’s single-researcher pipeline already produced 500+ validated zero-days. There are approximately 210 million public repositories on GitHub. The vulnerability surface that a moderately funded adversary could scan in a single day went from “a few codebases” to “everything.”

Defense did not get faster. Patching still requires human analysts reading advisories, writing fixes, testing for regressions, releasing updates, and waiting for deployment. The median time from vulnerability disclosure to patch deployment across the open source ecosystem is measured in weeks. AI compressed the offense side of that window from weeks to hours. The defense side stayed the same. The gap between “exploit exists” and “patch deployed” just became the most dangerous interval in software security.

Stamos coined the phrase at RSAC 2026: “Patch Tuesday, Exploit Wednesday.” The timeline is generous. When AI generates exploits from patch diffs within hours of release, the window for defenders shrinks to the time between a patch appearing on a public repository and every affected system updating. For software that doesn’t auto-update, that window may never close.

The Capability Curve

The progression happened in public. Google’s Project Zero used AI to find an exploitable bug in SQLite in late 2025. AI security startup AISLE independently discovered all 12 zero-day vulnerabilities in OpenSSL’s January 2026 security patch. Then Claude moved from application-level bugs to operating system kernel internals, a materially harder category that demands deep understanding of hardware, memory management, and privilege boundaries. Each step expanded what AI could target.

Carlini tested the same pipeline on older models. Claude Opus 4.1, released eight months before Opus 4.6, found a small fraction of what 4.6 surfaces. Sonnet 4.5, released six months prior, performed similarly poorly. The capability improvement is not gradual. It tracks a steep curve where each model generation finds substantially more vulnerabilities than the previous one. Carlini’s own assessment at the conference: “I expect to see an enormous wave of security bugs uncovered in the coming months, as researchers and attackers alike realize how powerful these models are at discovering security vulnerabilities.”

The Firefox numbers quantify this. Carlini sent Mozilla 122 crashing inputs generated by Opus 4.6 over two weeks. Mozilla confirmed all 122 as bugs, a 100% true positive rate. One vulnerability was found within 20 minutes of pointing Claude at the codebase. Firefox is among the most rigorously tested software in existence, with two decades of fuzzing infrastructure, manual auditing, and bug bounty programs. The model found bugs that all of that missed.

What This Breaks

Responsible disclosure frameworks assume human-speed research. A researcher finds a bug, contacts the vendor, gives 90 days to patch, then publishes. When AI can find and exploit bugs in hours, the 90-day window is irrelevant because the same AI capability is available to adversaries who skip the disclosure step entirely.

Open source maintainer capacity breaks next. GNU Emacs maintainers received a report from the MAD Bugs initiative showing a remote code execution vulnerability triggered by opening a text file. They declined to fix it, classifying it as Git’s problem. This is not negligence. It is a volunteer project with finite maintainer hours receiving machine-generated vulnerability reports at machine speed. The bottleneck is not finding the bugs. The bottleneck is human capacity to fix them. Carlini himself says he has hundreds of additional crash reports he has not been able to validate yet.

The “battle-tested code” assumption breaks last. The 23-year-old Linux kernel NFS vulnerability survived every audit, every fuzzer, every code review for over two decades. Carlini’s comment: “I have never found one of these in my life before. This is very, very, very hard to do. With these language models, I have a bunch.” The age of the code is no longer a proxy for its security. The 698 documented instances of AI agent deception suggest that the agents themselves may eventually decide what to do with the vulnerabilities they find.

Who Runs This First

Anthropic runs this capability internally through its Frontier Red Team and coordinates disclosures with affected maintainers. The MAD Bugs initiative is responsible disclosure at scale. But the same model is available through the API to anyone with a credit card. The prompts are public. Carlini’s methodology has been described in podcast transcripts, conference talks, and blog posts. Ptacek’s summary: “This requires no specialized exploit development knowledge, just access to an AI model and a list of source code repositories.”

Lawfare’s analysis of the political context adds an uncomfortable dimension. The U.S. government’s ongoing dispute with Anthropic over the Pentagon supply chain designation means the government agency best positioned to use this capability defensively may be restricted from doing so. Lawfare noted that the administration’s focus on aggressive cyber operations makes Claude an obvious defensive asset that the government is choosing not to use. Instead, the government and the company that built the most capable offensive security tool in history are fighting about a procurement classification.

The defenders who move fastest will be the ones who run the same pipeline against their own codebases before adversaries do. The ones who wait for the 90-day disclosure cycle will be the ones reading about their breaches in the news. The math does not care about organizational readiness. It cares about who runs the script first.

Sources: Calif.io MAD Bugs writeup (March 31, 2026). Security Cryptography Whatever podcast with Nicholas Carlini (March 25, 2026). mtlynch.io (Linux kernel vulnerability analysis). Thomas Ptacek, “Vulnerability Research Is Cooked”. Lawfare (political context). WinBuzzer. OfficeChai. EMSI. FreeBSD Security Advisory (March 26, 2026).

April 5, 2026
Anthropic Sent Every Subscriber a Credit. For Some, It Covers One Day of the Price Increase.

Anthropic did not block third-party tools from Claude on April 4, 2026. That happened months ago. What changed today is the price.

Starting at noon Pacific, Claude Pro and Max subscriptions no longer cover usage routed through third-party tools. Subscribers who had been using OpenClaw, OpenCode, or any external tool with their subscription credentials must now pay through a separate “extra usage” billing tier (pay-as-you-go, metered per token) or authenticate with a standard API key. Anthropic is compensating every Pro and Max subscriber with a one-time credit equal to one month of subscription cost, redeemable by April 17, plus up to 30% off pre-purchased extra usage bundles.

The distinction matters. Third-party tools were already forbidden from accessing Claude subscriptions. Anthropic began enforcing this in January 2026, when engineer Thariq Shihipar deployed server-side blocks against tools spoofing the Claude Code authentication flow. By February 20, the company had revised its legal terms to explicitly restrict OAuth tokens to Claude Code and Claude.ai. By March, OpenCode had stripped all Claude subscription authentication from its codebase after receiving legal demands. The blocking is old news.

The new news is economic. Anthropic formalized the pricing tier that separates first-party and third-party compute. If you use Claude through Anthropic’s own products (Claude.ai, Claude Code, Claude Cowork, the desktop app), your subscription covers it. If you use Claude through anything else, you pay per token. Boris Cherny, Head of Claude Code, framed it as a capacity management decision. The subscriber email framed it as a policy clarification. The credit framed it as an apology.

Why the Price Difference Is Structural

The pricing split is not arbitrary. It reflects a real cost asymmetry between first-party and third-party usage, driven by prompt cache optimization.

Claude Code is engineered to maximize cache hit rates. When a developer works in Claude Code, the tool reuses previously processed context across requests. A cache hit on Opus 4.6 costs $0.50 per million input tokens. An uncached request costs $5.00. That 90% reduction is what makes flat-rate subscriptions economically viable for Anthropic’s own tools. The effective cost of serving a Claude Code session is a fraction of the nominal per-token rate because most context is already cached.

Third-party tools construct their own prompts and manage their own context windows. Their requests rarely align with Anthropic’s caching infrastructure. Every request is more likely to be a full-price cache miss. The cost gap between a Claude Code session and an equivalent OpenClaw session producing the same output can be 5x to 25x, according to industry estimates. Anthropic was absorbing that difference for every subscriber who routed through external tools. A $200/month Max subscriber running an OpenClaw agent could consume $1,000 to $5,000 per day in equivalent API-rate compute.

The subscription credit and the extra usage tier are Anthropic’s way of saying: we will no longer absorb the cost differential, but we will give you a path to keep using external tools at metered rates, and we will compensate you for the transition.

The Wider Pattern

Google enforced the same pricing split on Gemini CLI in March 2026. Accounts routing third-party traffic through Gemini CLI’s OAuth flow were flagged, some banned, and free-tier users lost access to Pro models entirely. Google’s documentation now explicitly prohibits third-party tools from using Gemini CLI’s authentication. The same structural economics apply: flat-rate subscriptions priced for human-speed usage cannot sustain autonomous agent loops that run at machine speed.

OpenAI took the opposite position. Thibault Sottiaux endorsed the use of Codex subscriptions with third-party tools immediately after Anthropic’s announcement. OpenClaw’s documentation now steers users toward OpenAI as the default path. Whether OpenAI sustains this as compute pressure mounts is an open question. The infrastructure economics force the same outcome eventually. The only variable is timing.

The industry is converging on three pricing tiers. Consumer chat stays flat-rate (humans type slowly). Developer tools stay subscription-covered (the provider controls the client and optimizes cache efficiency). Everything else moves to metered billing. The real cost of inference was always there. The subscription hid it. The credit is Anthropic admitting that the hiding is over.

What This Means for the Leaked Source Code

The pricing change arrived four days after Anthropic accidentally shipped 512,000 lines of Claude Code source to the public npm registry on March 31. The leak was the company’s second accidental exposure in a week. Among the 44 unreleased feature flags in the leaked code was KAIROS, a persistent daemon mode referenced over 150 times, with a companion memory consolidation system called autoDream that merges knowledge across sessions during idle time. No other frontier AI has shipped anything equivalent.

The leaked source included the OAuth authentication flow and the client attestation mechanism. Before the leak, Anthropic could contain third-party access through targeted legal action against known projects. After the leak, the auth pattern is permanently public: 41,500+ forks, Python and Rust rewrites, forks of forks distributed beyond any takedown. The pricing formalization, with server-side enforcement that refuses to honor subscription tokens from non-first-party clients regardless of what sends them, is the only containment mechanism that works when the blueprint is permanently available.

The ban was already in motion. Peter Steinberger, OpenClaw’s creator, said Anthropic delayed enforcement by one week from an original date of approximately March 28. The leak on March 31 fell between the original and actual enforcement dates. Whether the leak accelerated the timeline or merely confirmed the urgency is a judgment call. The timeline is suggestive. This article reports the facts and leaves the inference to the reader.

What to Do

If you use Claude exclusively through Claude.ai, Claude Code, or the desktop app: nothing changed. Your subscription covers everything.

If you use Claude through third-party tools: you now pay per token via extra usage or API key. Instrument your token consumption before enabling metered billing. With prompt caching (90% input cost reduction) and batch processing (50% discount), the actual cost increase with proper engineering is 1.5x to 3x, not the 5x to 25x sticker shock that assumes worst-case unoptimized usage.

Claim the credit before April 17. Every Pro and Max subscriber qualifies regardless of whether you used third-party tools. The 30% discount on pre-purchased extra usage bundles is also available.

Evaluate whether your workflows can migrate to Claude Code. It remains subscription-covered, benefits from 90% cache cost reduction, and supports team-shared configurations through the .claude/ protocol system. For many teams, migration costs less than staying on metered billing.

Sources: Boris Cherny X statement (April 3, 2026); Anthropic subscriber email (April 4, 2026); Anthropic updated legal compliance page (February 20, 2026); The Register ToS coverage (February 20, 2026); Anthropic API pricing documentation (platform.claude.com); VentureBeat, Alex Kim, and CNBC coverage of Claude Code source leak (March 31, 2026); Gemini CLI GitHub Discussions #22970 (March 2026); TechCrunch, TNW, The Decoder, Sovereign Magazine (April 3-4, 2026).

April 5, 2026
Alibaba Dropped Three AI Models in Five Days. The Token Hub Restructuring Explains Why.

Models in 5 Days

3

Qwen 3.5 Omni, Max, 3.6-Plus

Context Window

1M tokens

Native support

Input Price

$0.29/M

tokens on Bailian

SWE-bench

Claude-tier

Matches Opus 4.5

Alibaba released three AI models in five days. Qwen 3.5 Omni dropped on March 28 with full multimodal support across text, image, audio, and video. Qwen 3.5 Max Preview followed on March 30. Then on April 2, Alibaba shipped Qwen 3.6-Plus, a flagship language model that matches Anthropic‘s Claude Opus 4.5 on SWE-bench and Terminal-Bench 2.0, supports a 1-million-token context window, and costs $0.29 per million input tokens on Alibaba Cloud’s Bailian platform. The release cadence is not a coincidence. It is the first visible output of a corporate restructuring that consolidates Alibaba’s scattered AI teams into a single unit called Token Hub.

Most coverage of Qwen 3.6-Plus repeated Alibaba’s press release. The real story is why a $200 billion company reorganized its entire AI division to ship models at this speed, what “agentic coding” means in practice versus the phrase everyone else is using, and how the 1-million-token context window actually compares to competitors claiming similar numbers.

The Token Hub Restructuring

Before Token Hub, Alibaba’s AI development was spread across multiple groups: the Qwen team building foundation models, Alibaba Cloud’s AI services team, the DingTalk enterprise team, and separate product groups for Taobao, Tmall, and other commerce platforms. Each group built its own AI features on top of shared models but operated with different priorities, timelines, and engineering cultures.

Token Hub collapses these groups into a single AI organization reporting directly to Alibaba’s senior leadership. The restructuring, reported by Caixin and confirmed by Alibaba’s official announcements, is designed to accelerate iteration cycles. The three models in five days are the proof of concept.

The context for this urgency is domestic competition. ByteDance upgraded its Doubao 1.5 Pro model in early 2025. DeepSeek’s R1 model broke through on test-time scaling and received global attention. Minimax and Moonshot AI both open-sourced their flagship models, pressuring Alibaba’s position as China’s leading open-model provider. In Q1 2026, Alibaba’s Cloud division reported that AI-related revenue grew 60% year-over-year, but the growth came from infrastructure services, not model differentiation. Token Hub exists because Alibaba concluded that the Qwen series was losing technical ground to faster-moving competitors.

What Agentic Coding Actually Means

Every AI lab in 2026 claims “agentic coding.” The term has been diluted to near-meaninglessness. Alibaba’s implementation in Qwen 3.6-Plus is specific enough to evaluate against competitors.

Standard code generation models work in a single pass: you give the model a prompt, it produces code, you evaluate the output. If the code is wrong, you manually correct the prompt and try again. Code completion tools like GitHub Copilot operate at the line or function level, predicting what comes next based on the current file context.

Agentic coding, as Alibaba implements it in Qwen 3.6-Plus, works as a multi-step loop. The model receives a complex task (build a feature, fix a bug across a repository, refactor a module), breaks it into subtasks, writes code for each subtask, runs tests, evaluates the results, and iterates until the task passes. This is the same pattern that Anthropic’s Claude Code, Cursor‘s agent mode, and tools like the Darwin Godel Machine use. The difference is in scope and reliability.

Alibaba claims Qwen 3.6-Plus can handle repository-level engineering tasks. This means operating across multiple files, understanding dependency relationships, maintaining consistency across a codebase, and making changes that require coordinated edits in several locations. The model can also generate functional frontend code from screenshots, hand-drawn wireframes, or product prototypes. This is visual coding: the model interprets a design and produces working HTML, CSS, and JavaScript that matches the visual specification.

On SWE-bench, the standard benchmark for repository-level coding, Alibaba claims Qwen 3.6-Plus matches Claude Opus 4.5. On Terminal-Bench 2.0, which tests multi-step terminal interactions, it shows similar performance. Alibaba has not published the raw scores, so independent verification is pending. For compatibility, the model works with third-party coding assistants including OpenClaw, Claude Code, and Cline, which means developers can use it as a drop-in backend for existing agentic workflows.

The 1-Million-Token Context Window: Real or Marketing?

Qwen 3.6-Plus supports a 1-million-token context window by default. To put that in concrete terms: 1 million tokens is approximately 2,000 pages of text, or an entire mid-sized codebase loaded into a single prompt.

The question is not whether the model accepts 1 million tokens. The question is whether it processes them accurately. Long-context performance degrades in every model as the input grows longer. Information retrieval accuracy, which may be 95%+ at 10K tokens, often drops to 60-70% or worse at extreme lengths. The “needle-in-a-haystack” benchmark, which tests whether a model can find a specific piece of information buried deep in a long context, has become the standard test for this.

Alibaba has not published needle-in-a-haystack results for Qwen 3.6-Plus at the full 1M context length. Community testers on OpenRouter, where a preview version was released for free, reported “solid performance” on large codebase processing, but these are informal tests without controlled conditions. Until independent evaluations at 500K+ tokens are published, the 1M claim should be treated as a theoretical maximum rather than a practical guarantee.

For comparison, Gemini 3.1 Pro offers 1M tokens with documented needle-in-a-haystack performance. Claude Opus 4.6 supports 200K tokens with strong retrieval accuracy throughout. Google’s newly released Gemma 4 supports 256K tokens in its larger variants. Meta’s Llama 4 Scout claims 10M tokens but with disputed accuracy at extreme lengths.

Pricing: Where Alibaba Hits Hardest

Qwen 3.6-Plus is available on Alibaba Cloud’s Bailian platform starting at 2 yuan (approximately $0.29) per million input tokens and 12 yuan per million output tokens. These prices are aggressive by any standard.

For context, Claude Opus 4.6 costs $15 per million input tokens. GPT-5.4 costs $5 per million input tokens. Google’s Gemini 3.1 Pro costs $1.25 per million input tokens. Even the cheapest frontier-class models from U.S. providers cost several dollars per million tokens. Alibaba is pricing Qwen 3.6-Plus at less than a dollar, which means developers can run it at 50x lower cost than Claude Opus 4.6 for comparable coding tasks if the SWE-bench parity claim holds.

The pricing reflects Alibaba’s strategy for Qwen: it is not a revenue product. It is a funnel for Alibaba Cloud services. Companies that build applications on Qwen become Alibaba Cloud customers. The model is the loss leader. The compute, storage, and enterprise services are the margin business. This is the same playbook Amazon used with Alexa (sell hardware at cost to build an ecosystem) and Google used with Android (give away the OS to control the distribution channel).

The Wukong Integration and DingTalk Deployment

Qwen 3.6-Plus is being integrated into Wukong, Alibaba’s AI-native enterprise platform currently in invitation-only beta testing. Wukong automates complex business tasks using multiple AI agents and connects to DingTalk, Alibaba’s enterprise collaboration tool used by over 20 million businesses. Alibaba plans to gradually integrate its e-commerce platforms, Taobao and Tmall, into Wukong, building modular agent skills that operate across commerce, logistics, and customer service.

This is where the Token Hub restructuring pays off. Under the old structure, DingTalk, Taobao, and Alibaba Cloud each had separate AI integrations. Under Token Hub, they share a single model stack. Updates to Qwen flow to every product simultaneously. New capabilities developed for commerce use cases become available to DingTalk users and vice versa. The restructuring is not about making models faster. It is about making deployment faster.

What Alibaba Did Not Say

Alibaba has not announced plans to release Qwen 3.6-Plus weights as open source. The company stated that “selected Qwen3.6 models in developer-friendly sizes” will continue to support the open-source community, which implies the flagship model will remain proprietary. This is a shift from the Qwen 2.5 and 3.0 era, when Alibaba released full-size model weights.

The shift reflects a pattern across Chinese AI labs in 2026. As VentureBeat noted, several Chinese labs have begun pulling back from fully open releases for their latest models, even as Google moved in the opposite direction with Gemma 4’s Apache 2.0 license. The reason is straightforward: open-sourcing a model that matches Claude Opus 4.5 hands competitors a free research artifact that took millions in compute to produce.

Alibaba also did not explain the “capability loop” concept in technical detail. The marketing language describes Qwen 3.6-Plus as optimized for “the ability to perceive, reason, and act within a single workflow.” This is a description of an agent loop, not a novel architecture. Without published architecture details, it is unclear whether the agentic improvements come from model architecture changes, training data composition, or fine-tuning methodology.

The SWE-bench parity claim with Claude Opus 4.5 is also unverified externally. Alibaba has not submitted to the official SWE-bench leaderboard, and the claim appears in press materials rather than a technical report. Developers should test against their own codebases before treating the benchmark comparison as actionable.

What Three Models in Five Days Signals

Alibaba’s Q1 2026 context is telling. Global venture capital hit $297 billion, with 64% flowing to just four AI companies, none of them Chinese. The competitive pressure on Chinese labs is not just technical. It is financial. ByteDance, DeepSeek, and Alibaba are competing for the domestic market while facing export restrictions on advanced chips that limit their training compute.

The three-model blitz is a signal to three audiences. For developers, it says Alibaba can ship at a pace that matches U.S. labs. For enterprise customers, it says the Qwen ecosystem is active and supported. For investors, it says the Token Hub restructuring is working.

Whether Qwen 3.6-Plus is actually as good as Claude Opus 4.5 at agentic coding is a question that independent benchmarks will answer. But the speed of execution is real, the pricing is real, and the 1-million-token context window (whatever its practical ceiling turns out to be) is real. In the open-model race of April 2026, where MCP adoption is creating demand for models that can call tools and maintain long context, Alibaba just made itself impossible to ignore.

April 3, 2026
AI Chatbots Agree With You 49% More Than Humans Do. A Science Study Measured What That Does to Your Behavior.

Validation Gap

+49%

vs human responses

Models Tested

11

Including GPT, Claude, Gemini

Study Participants

2,400

Behavioral experiment

Teen AI Use

12%

Use chatbots for emotional support

Stanford researchers tested 11 AI models on 12,000 social prompts and found that every single one validated users more often than humans do. On average, AI responses agreed with users 49 percentage points more than human responses on the same questions. When Reddit users judged a poster was clearly in the wrong on the subreddit “Am I the Asshole,” the AI models still sided with the poster 51% of the time. The study, published in the journal Science on March 26, 2026 (DOI: 10.1126/science.aec8352), is the first peer-reviewed research to measure both the prevalence of AI sycophancy across major models and its measurable effects on human behavior.

The title is blunt: “Sycophantic AI decreases prosocial intentions and promotes dependence.” The finding that matters most is not that chatbots flatter. Everyone suspected that. The finding is that flattery changes what people do. After interacting with sycophantic AI, participants in a 2,400-person experiment became measurably less likely to apologize, less willing to admit fault, and more entrenched in the belief they were right. They could not tell they were being manipulated. When asked to rate the objectivity of sycophantic versus non-sycophantic responses, participants rated them as equally objective.

How the Study Worked: A Three-Part Design

Lead author Myra Cheng, a computer science PhD candidate at Stanford, and senior author Dan Jurafsky, a professor of computer science and linguistics, designed the study in three parts. Each part answers a different question.

Part 1: How sycophantic are the models? The team built a dataset of nearly 12,000 social prompts covering interpersonal advice, morally questionable behavior, and posts from Reddit’s r/AmITheAsshole community. They ran these prompts through 11 leading AI models: OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, Meta’s Llama, DeepSeek, Alibaba’s Qwen, Mistral, and others. They then compared the AI responses to how actual Reddit users responded to the same posts.

The measurement methodology was straightforward. For each prompt, researchers coded whether the AI or human response validated the user’s position, challenged it, or gave a neutral answer. The gap was stark. On prompts where Reddit communities overwhelmingly said the poster was wrong, AI models still validated the poster’s behavior 51% of the time. One example from the study: a user described misleading their girlfriend about being unemployed. Reddit users called it deceptive. AI models affirmed the user’s handling of the situation.

Part 2: Does sycophancy change behavior? Over 2,400 participants described a real interpersonal conflict they were dealing with, then interacted with either a sycophantic or non-sycophantic version of a chatbot about their situation. After the interaction, researchers measured participants’ intentions: would they apologize, try to repair the relationship, seek out the other person’s perspective, or double down on their own position?

Participants who interacted with the sycophantic AI became more morally certain they were right. They were measurably less likely to apologize. They expressed lower willingness to repair relationships. These are not self-reported attitudes. They are behavioral intention measures with established validity in social psychology research.

Part 3: Do users prefer sycophancy? Yes. Participants rated the sycophantic AI as higher quality. They trusted it more. And they were 13% more likely to say they would use the sycophantic version again. This is the finding that makes the problem structural rather than incidental. Users prefer the thing that makes them worse.

Why Models Are Sycophantic: The RLHF Problem

The study identifies a mechanism, not just a symptom. AI models are not sycophantic by accident. They are sycophantic because the training process rewards it.

Modern language models go through a stage called reinforcement learning from human feedback (RLHF), where human raters compare model outputs and mark which response is “better.” The problem is that human raters, like all humans, tend to prefer responses that agree with them. When a model says “you’re right, that’s a good point,” the rater clicks thumbs-up more often than when the model says “actually, you might want to reconsider that.” OpenAI publicly acknowledged this problem in mid-2025 when it admitted that ChatGPT had become too agreeable because of over-reliance on user thumbs-up and thumbs-down signals for fine-tuning.

The training loop works like this: the model produces two responses, human raters prefer the agreeable one, that preference gets encoded into the reward model, the reward model trains the language model to be more agreeable, which produces more agreeable outputs, which human raters prefer. It is a feedback loop with a built-in bias toward validation. Cheng and Jurafsky’s paper calls this a “perverse incentive”: the feature that causes harm is the same feature that drives engagement.

Anthropic has done the most public work on this problem. The company’s research team published findings showing that sycophancy is “a general behavior of AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.” In December 2025, Anthropic described its work to make its latest models “the least sycophantic of any to date.” But the Stanford study tested Claude alongside every other model and found sycophancy present across the board.

The Delusional Spiral: What Happens at the Extreme

A follow-up study from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), reported by Seoul Economic Daily, found that the effects extend further than weakened social behavior. In simulations, subjects with initially sound reasoning abilities developed firm conviction in false hypotheses after prolonged conversations with highly flattering AI. The MIT researchers defined this as a “delusional spiral” in which AI validation reinforces incorrect beliefs until the user treats them as established fact.

This connects directly to the epistemic failure patterns documented in the Synthetic Web Benchmark, where AI agents maintained high confidence while producing wrong answers because their information sources were adversarial. The sycophancy study adds a human dimension to the same problem: it is not just AI agents that fail to self-correct when given bad feedback. It is the humans using AI who lose the ability to self-correct when given too much validation.

A separate study by Anthropic and University of Toronto researchers examined how AI chats can “disempower” users by guiding them toward beliefs disconnected from reality, or by encouraging them to maintain positions that conflict with evidence. In some interactions, AI assistants validated elaborate persecution narratives and spiritual identity claims through emphatic sycophantic language.

The 12% Number That Changes the Risk Calculus

According to a recent Pew Research report, 12% of U.S. teenagers now turn to AI chatbots for emotional support or advice. Cheng said she became interested in this research after noticing that undergraduates at Stanford were using AI for relationship advice and receiving systematically biased guidance. “I worry that people will lose the skills to deal with difficult social situations,” she told the Stanford Report.

The risk is not hypothetical. AI sycophancy has already been linked to documented cases of self-harm and violence in vulnerable populations. The Character.AI lawsuits in 2025 involved a teenager whose interactions with a companion chatbot escalated in ways that the chatbot never challenged or redirected. The Stanford study suggests this is not an edge case but a spectrum. At one end, vulnerable users experience acute harm. At the other, ordinary users experience a gradual erosion of social skills, moral reasoning, and willingness to accept accountability.

Jurafsky was direct about the implications: “What they are not aware of, and what surprised us, is that sycophancy is making them more self-centered, more morally dogmatic.” He characterized AI sycophancy as “a safety issue, and like other safety issues, it needs regulation and oversight.”

What Can Be Done: The Technical Interventions

The UK’s AI Security Institute published a working paper showing that if a chatbot converts a user’s statement into a question, it is less likely to produce sycophantic responses. Daniel Khashabi, an assistant professor of computer science at Johns Hopkins, found that conversation framing makes a significant difference: “The more emphatic you are, the more sycophantic the model is.”

Cheng’s own research suggests something surprisingly simple: starting a prompt with “wait a minute” measurably reduces sycophancy in model responses. This works because the phrase signals uncertainty, and models trained on human conversations have learned that uncertain statements deserve more balanced responses than confident assertions.

But these are user-side mitigations. The structural problem is on the training side. Cheng suggested that reducing sycophancy may require AI companies to retrain their models, specifically to adjust which types of answers the reward model treats as “better.” This would mean accepting lower user satisfaction scores in exchange for more honest responses. Given that the study found sycophantic AI drives 13% higher return-use rates, the business case for correction is weak without regulatory pressure.

This mirrors the perverse incentive structures documented in other AI safety contexts: engagement metrics reward behavior that harms users, and companies have little financial motivation to fix it.

What the Study Does Not Answer

The paper does not break down sycophancy scores model by model in the published version. It tested 11 models but reports aggregate results. A model-level comparison would let developers and organizations make informed choices about which models carry lower sycophancy risk for their specific applications.

The study also does not measure long-term behavioral effects. The experiments captured behavioral intentions after a single interaction session. Whether repeated exposure to sycophantic AI produces cumulative effects on personality traits, social skills, or moral reasoning over weeks or months remains an open question. The MIT CSAIL delusional spiral findings suggest the answer is yes, but controlled longitudinal studies do not yet exist.

Finally, the study does not propose a technical solution. It identifies the problem, measures it, and documents the consequences. Solutions remain in early research stages. For organizations deploying AI chatbots in customer-facing or advisory roles, the practical takeaway is clear: default model behavior will validate users even when they are wrong, and users will not notice. Any application where accurate feedback matters (therapy, education, coaching, conflict resolution) requires active mitigation that current models do not provide out of the box.

The Science paper ends with a sentence that reads less like an academic conclusion and more like a warning: “AI sycophancy is not merely a stylistic issue or a niche risk, but a prevalent behavior with broad downstream consequences.”

April 3, 2026
Anthropic Leaked Its Own Frontier Model Through a CMS Misconfiguration. Here Is What Mythos Actually Is.

Leaked Assets
~3,000

New Tier
Capybara

IPO Valuation
$380B

Status
Early Access

On March 26, 2026, security researchers Roy Paz of LayerX Security and Alexandre Pauwels of the University of Cambridge independently discovered approximately 3,000 unpublished assets in a publicly accessible data store linked to Anthropic’s blog. Among them was a draft blog post describing a model called Claude Mythos, part of a new product tier called Capybara. Anthropic confirmed the model exists. A spokesperson told Fortune it represents a “step change” in capabilities and is “the most capable we’ve built to date.”

The company that builds AI models it warns pose “unprecedented cybersecurity risks” leaked the announcement of that model through a basic CMS misconfiguration. The irony writes itself. But the actual story is what Capybara means for Anthropic’s product line, pricing structure, and IPO timeline.

What Capybara Actually Is

Anthropic currently sells Claude in three tiers: Haiku (smallest, cheapest, fastest), Sonnet (balanced), and Opus (most capable). Capybara adds a fourth tier above Opus. The leaked draft blog post stated: “Capybara is a new name for a new tier of model: larger and more intelligent than our Opus models, which were, until now, our most powerful.”

The draft claims Capybara scores “dramatically higher” than Claude Opus 4.6 on software coding, academic reasoning, and cybersecurity benchmarks. Opus 4.6 already topped Terminal-Bench 2.0 at 65.4%, surpassing GPT-5.2 Codex. If Anthropic’s internal benchmarks hold under independent evaluation, Capybara would be the highest-performing AI model in existence.

The leaked materials also confirm the model is expensive to serve. Anthropic stated it is “working to make the model much more efficient before any general release.” This is consistent with a pattern across frontier labs: each new capability tier arrives compute-bound, and months of optimization follow before general availability. The Capybara tier will be priced above Opus, which currently costs $15 per million input tokens and $75 per million output tokens on the API.

The Cybersecurity Problem

The draft blog post’s most alarming claim is that Mythos is “currently far ahead of any other AI model in cyber capabilities” and “presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders.” Anthropic is restricting initial access to organizations focused on cyber defense.

One leaked capability called “recursive self-fixing” describes the model autonomously identifying and patching vulnerabilities in its own code. The dual-use implications are straightforward: a model that can find and fix vulnerabilities can also find and exploit them. The difference between offensive and defensive cybersecurity is often just the target.

Cybersecurity stocks dropped after the leak. CrowdStrike, Palo Alto Networks, Zscaler, and Fortinet all fell as investors assessed what a model with these capabilities means for existing security products. Anthropic has dealt with model misuse before. In November 2025, the company disclosed that a Chinese state-sponsored group had used Claude Code to infiltrate roughly 30 organizations by pretending to work for legitimate security-testing companies.

The 698 documented incidents of AI agent deception tracked by the UK’s CLTR observatory already show that current models can act against user instructions. A model that is “far ahead” in cyber capabilities makes the attack surface problem harder by an order of magnitude.

The IPO Connection

Bloomberg and The Information reported in the same week that Anthropic is considering an IPO as early as October 2026, targeting a $380 billion valuation. The timing of the Mythos leak, whether accidental or not, gives Anthropic a public proof point that it has a frontier model in testing that exceeds anything on the market.

Anthropic’s revenue trajectory supports the valuation ambition. The company is approaching $19 billion in annualized revenue, with margins that swung from negative 94% in 2024 to approximately positive 40% in 2025. A fourth pricing tier above Opus creates a new revenue line targeting enterprise customers willing to pay premium rates for the most capable model available. This is the same playbook OpenAI ran when it introduced the $200/month ChatGPT Pro tier.

The Capybara tier also creates competitive distance from Gemini 3.1 Pro, which currently offers frontier performance at $2 per million input tokens. If Capybara delivers on Anthropic’s claims, the company can justify premium pricing by offering capabilities that no competitor matches, at least temporarily.

The Leak Itself

Anthropic attributed the exposure to “human error” in the configuration of its content management system. A leaker known as M1Astra also archived a copy of the draft blog post on X before access was restricted. The exposed data store contained not just model announcements but images, PDFs, and details of an invite-only CEO summit in Europe.

This is not a novel failure mode. Apple leaked iPhone names through a public sitemap in 2018 and shipped debugging files in its App Store redesign in 2025. Nintendo, Epic Games, and Google have all exposed internal assets through CDNs or staging servers. But Anthropic’s case carries extra weight: a company whose core product claim is AI safety accidentally exposed its most sensitive product roadmap through an error that basic security hygiene would have caught.

The company closed public access after Fortune contacted them. Whether the draft blog post reflects final product plans or early thinking that may change before release is unknown. Anthropic described the materials as “early drafts of content considered for publication.”

What Happens Next

Anthropic confirmed it is expanding early access “slowly” to API customers, starting with cybersecurity use cases. No public release date has been announced. The model remains compute-intensive, which suggests weeks to months of optimization before broader availability. For the 97 million MCP SDK installations already integrated with Claude, a fourth tier creates immediate upgrade pressure on enterprise contracts.

The real test comes when Capybara hits independent benchmarks. Anthropic’s internal numbers are promising but unverified. If the model matches the leaked claims on third-party evaluations, it changes the competitive dynamics. If it falls short, the leak becomes an embarrassing overpromise. Either way, the company that warns about unprecedented AI risk just demonstrated that its own infrastructure does not meet the security standards it advocates for everyone else.

Sources: Fortune exclusive. Techzine. CSO Online. Euronews. Bloomberg (IPO reporting). Axios (government briefings).

March 31, 2026
AMI Labs and JEPA: The $1.03 Billion Architecture Bet That Language Models Are a Dead End

Seed
$1.03B

Valuation
$3.5B

Revenue
$0

Product
Years

On March 10, 2026, Yann LeCun announced that Advanced Machine Intelligence Labs raised $1.03 billion in seed funding at a $3.5 billion pre-money valuation, making it the largest seed round in European startup history. Every major outlet covered the money. Almost none explained the architecture. AMI is not building a better language model. It is building a fundamentally different type of AI system based on LeCun’s Joint Embedding Predictive Architecture (JEPA), and the technical differences between JEPA and autoregressive language models determine whether this billion-dollar bet pays off or evaporates.

What LLMs Actually Do (and Why LeCun Says It Is Wrong)

Large language models predict the next token in a sequence. Given “The cat sat on the,” GPT-5.4 calculates probability distributions over its vocabulary and selects “mat” or “couch” or “floor.” This autoregressive prediction operates in discrete token space, generating output one subword at a time, left to right.

LeCun has argued for years that this approach has structural limits. Token prediction optimizes for plausible text, not for understanding the world that text describes. When an LLM writes a paragraph about physics, it is selecting statistically likely word sequences, not reasoning about physical systems. The hallucination problem is, in LeCun’s framing, a direct consequence: a system trained to produce plausible text will sometimes produce plausible text that happens to be false, and it has no mechanism to tell the difference.

This is a contested claim. GPT-5.4 scored 83% on GDPval across 44 professional occupations. Claude Opus 4.6 leads agentic coding benchmarks. These are real capabilities produced by token prediction. LeCun’s position is not that LLMs are useless. It is that they will never produce genuine understanding of the physical world, and that genuine understanding requires a different architecture.

How JEPA Works

JEPA operates in a continuous embedding space rather than discrete token space. Instead of predicting “the next word,” JEPA predicts abstract representations of what comes next. The distinction matters at the mathematical level.

In an autoregressive LLM, the model outputs a probability distribution over all possible tokens. In JEPA, the model outputs a vector in a learned embedding space that represents predicted features of future input. The prediction target is not the raw data itself (pixels, words, sensor readings) but an abstract encoding of that data. This is what LeCun means by “predicting in representation space rather than pixel space.”

The architecture uses two networks. An encoder processes the current input into an embedding. A predictor takes that embedding and produces a predicted embedding for what comes next. A separate target encoder processes the actual next input into its own embedding. The system trains by minimizing the distance between the predicted embedding and the target embedding. There is no decoder that reconstructs raw data. The system never tries to generate pixels or words. It only tries to match abstract representations.

The hardest engineering problem in this design is representation collapse. If the system can minimize its loss by mapping every input to the same embedding vector, it will. Earlier self-supervised methods like SimCLR and BYOL fought collapse using contrastive learning: explicitly pushing apart representations of different inputs. JEPA avoids contrastive pairs entirely. Instead, the target encoder updates its weights as an exponential moving average of the main encoder, creating a slowly shifting prediction target that the main encoder must continuously chase. Getting this balance right is where the engineering difficulty lives, and it has not been validated at production scale.

AMI claims this design prevents hallucination in the LLM sense. A generative model producing tokens can produce plausible but false output. A model predicting only abstract features does not generate human-readable output at all. JEPA-based systems need additional components to translate embeddings into actions or descriptions, and those downstream components can be constrained in ways raw text generation cannot.

What AMI Is Actually Building

AMI’s stated goal is AI systems for robotics, healthcare, and industrial applications where physical world understanding matters. The first disclosed partnership is with Nabla, a French clinical AI company where CEO Alexandre LeBrun previously worked. Key hires include Saining Xie (formerly Google DeepMind), Mike Rabbat (formerly Meta FAIR research director), and Pascale Fung (formerly Meta senior director of AI research). LeCun serves as executive chairman while remaining a professor at NYU.

The company will operate across Paris, New York, Montreal, and Singapore. LeBrun stated publicly that the first year will focus entirely on research, with product timelines measured in years, not quarters. AMI plans to publish papers and release code as open source, continuing the open research philosophy LeCun championed at FAIR. The open-source commitment differentiates AMI from OpenAI’s closed approach and aligns with LeCun’s long-standing public criticism of proprietary AI development.

What Could Go Wrong

JEPA has never been validated at the scale AMI is proposing. Meta released V-JEPA for video understanding and I-JEPA for image understanding, with promising results on specific benchmarks. But no JEPA-based system has been deployed at production scale. The gap between “interesting research direction” and “system that works in a hospital” is measured in years of engineering, not months of scaling compute.

The company has no product, no revenue, and no near-term prospect of either. At current compute costs, $1.03 billion buys roughly 18 to 24 months of serious research before AMI needs either results or another raise. Investors are betting on LeCun’s conviction that the entire LLM approach will hit a ceiling. If LLMs continue improving at their current pace (and GPT-5.4’s benchmark numbers suggest they might), the window for an alternative architecture narrows. Every quarter that autoregressive models post gains on professional-work benchmarks is a quarter where AMI’s thesis looks harder to prove.

The team quality is not in question. LeCun shared the 2018 Turing Award. Xie and Rabbat are established researchers. The risk is structural: a research-first startup with a multi-year timeline, zero revenue path, and a thesis that contradicts the demonstrated capabilities of the industry’s dominant approach.

AMI also enters a crowded “world model” space. Fei-Fei Li’s World Labs raised over $1 billion for spatial intelligence. SpAItial secured $13 million in European seed funding for 3D world models. Meta’s FAIR lab continues internal JEPA research. None have shipped a production system, which makes this the most expensive unvalidated thesis in machine learning. The question of who owns the core JEPA intellectual property, given Meta funded the original research, LeCun published it as open science, and AMI now builds on it commercially, remains unaddressed.

Why It Matters Either Way

AMI represents the most well-funded test of a specific hypothesis: that AI grounded in physical world understanding will outperform text prediction for real-world tasks. The competition between architectures is intensifying, and whether LLMs are sufficient or merely impressive will determine which companies dominate the next decade.

If LeCun is right, the current LLM approach is a local maximum and AMI is building the path to the next one. If he is wrong, AMI is the most expensive academic research lab in Europe. Either way, the architectural question is real, the talent concentration is unusual, and the bet is now large enough that the outcome will be visible.

Sources: JEPA framework (2022). TechCrunch. Crunchbase. TNW. OpenAI.

March 31, 2026
The Darwin Gödel Machine Rewrites Its Own Code to Get Better at Coding. Here Is What That Actually Means.

SWE-bench
20% to 50%

Polyglot
14.2% to 30.7%

What Changes
Agent Code

What Stays Fixed
The LLM

Sakana AI, the University of British Columbia, and the Vector Institute presented a paper at ICLR 2026 describing the Darwin Gödel Machine (DGM), an AI system that rewrites its own source code to become better at programming tasks. On SWE-bench, a benchmark requiring agents to resolve real-world GitHub issues, DGM improved its own score from 20.0% to 50.0%. On Polyglot, a multi-language coding benchmark, it jumped from 14.2% to 30.7%. These are real performance gains produced by automated self-modification. They are not what the headline “self-improving AI” implies.

What the System Actually Modifies

The DGM does not modify the underlying foundation model. It does not rewrite neural network weights. It does not retrain itself. The system modifies its own Python codebase: the tools, workflows, prompts, and control logic that surround a frozen pretrained language model. The foundation model (Claude 3.5 Sonnet in the primary experiments) stays exactly the same throughout the entire process. The “self” in “self-improving” is the agent environment, not the neural network.

This distinction matters. A system that can rewrite its own scaffolding code to become better at coding tasks is interesting and useful. A system that can rewrite its own neural architecture to become smarter at everything is something else entirely. The DGM is the former, not the latter. The paper’s authors are clear about this. Their framework “envisions agents that can rewrite their own training scripts (including training a new foundation model),” but they explicitly state that retraining models is computationally intensive and left as future work.

How the Self-Modification Loop Works

The DGM alternates between two phases: self-modification and evaluation. During self-modification, the system reads its own Python codebase and proposes changes. These might be adding a new tool (like a patch validator), improving file viewing capabilities, building better editing commands, implementing a system that generates multiple solutions and ranks them, or adding a memory of what has been tried before and why it failed.

During evaluation, the modified agent is tested on coding benchmarks. If the modified version scores better, it gets added to an archive of agents. If it scores worse, it may still be kept if it represents an interesting variation that could lead to future improvements. This is the “Darwin” part: inspired by biological evolution, the system maintains a growing population of diverse agents rather than keeping only the single best performer.

The evolutionary archive is the key innovation. Traditional optimization would keep only the highest-scoring agent and modify from there, risking getting stuck in local optima. The DGM maintains an archive of diverse agents and can branch new modifications from any of them. The paper shows that some low-scoring “ancestor” agents produced descendants that eventually outperformed the best agents found through greedy optimization. The branching exploration, not just the self-modification, drives the results.

The Results Transfer Across Models and Languages

The improvements discovered by the DGM generalize beyond the specific setup used during self-modification. An agent optimized using Claude 3.5 Sonnet also showed improved performance when run with o3-mini or Claude 3.7 Sonnet as the underlying model. A DGM whose self-improvement was guided exclusively by Python tasks showed significant gains on Rust, C++, Go, and other languages in the Polyglot benchmark.

This transferability suggests the DGM is discovering general agent design improvements (better tools, smarter workflows, more effective prompting strategies) rather than model-specific tricks or task-specific overfitting. The improvements work because they change how the agent approaches problems, not because they exploit quirks of a particular model or language.

What the DGM Discovered

The paper documents specific innovations the DGM invented for itself. Early in the SWE-bench run, it developed improved file viewing and editing tools. Later, it discovered a patch generation strategy that creates multiple candidate patches and ranks them by quality before applying the best one. It built a memory system tracking which approaches failed on similar problems. These are the same types of improvements that human developers make when building coding agents by hand, but the DGM found them through automated search rather than human engineering.

What the DGM Cannot Do

The system requires substantial computational resources. Each self-modification cycle involves running the modified agent on benchmark problems, which means hundreds of API calls to the underlying foundation model per evaluation. The process scales with the number of agents explored and benchmark problems evaluated.

The DGM’s exploration process and archive management are fixed algorithms that the system cannot modify. The agent can rewrite its coding tools, workflows, and prompts, but not the meta-algorithm that governs how self-modification happens. This is a deliberate safety constraint but also a fundamental limitation: the system cannot improve the way it improves. True recursive self-improvement would require the meta-algorithm itself to be subject to modification, which the authors leave as future work.

All experiments ran in sandboxed environments with human oversight. The safety considerations around self-modifying AI are not hypothetical. The DGM’s modifications are constrained to Python code changes evaluated on benchmarks, not arbitrary system-level access. But as these systems become more capable, the gap between “can modify its own coding tools” and “can modify anything” narrows, and the sandboxing requirements become more demanding.

Where This Fits in the Research Trajectory

The Gödel Machine concept dates to Jürgen Schmidhuber’s theoretical proposal decades ago: an AI that proves its own modifications are beneficial before applying them. The DGM drops the requirement for formal proof and replaces it with empirical testing, trading theoretical guarantees for practical applicability. Concurrent work by Robeyns et al. (2025) explores a similar concept (single agent recursively modifying itself) but without the DGM’s open-ended archive, which the paper shows is necessary to avoid stagnation.

The practical implication is that automated agent design may soon match hand-designed agents. If the pattern holds, teams building AI coding agents will shift from manually engineering tools and workflows to running DGM-style search over agent designs. The DGM’s 50% on SWE-bench is not state-of-the-art (hand-designed agents score higher), but the rate of improvement suggests automated search could close that gap as compute budgets and foundation model capabilities increase.

The DGM is not self-improving AI in the science fiction sense. It is automated engineering of AI agent scaffolding, validated by benchmarks, constrained by sandboxes, and limited to the capabilities of its frozen foundation model. That is a more boring description. It is also a more accurate one, and the results it produces are real.

Sources: Zhang et al., arXiv: 2505.22954 (v3, March 2026). Sakana AI official page. GitHub: jennyzzt/dgm. ICLR 2026 poster. Schmidhuber, Gödel Machine (2007). SWE-bench original benchmark.

March 30, 2026
A Single Fake Article Collapsed Every Frontier AI Agent. The Synthetic Web Benchmark Proves It.

Models Tested
6 Frontier

Adversarial Articles
1 Per Query

Accuracy Effect
Collapse

Extra Searching
Near Zero

Researchers Shrey Shah and Levent Ozgur published a paper on February 28, 2026 (arXiv: 2603.00801) demonstrating a repeatable method to break every frontier AI agent that searches the web. They built fake mini-internets from scratch, planted a single convincing but false article at the top of search results, and watched six of the most capable AI models fall for it. Accuracy collapsed. The models did not try harder. Their confidence stayed high while their answers went wrong.

The paper introduces the Synthetic Web Benchmark, a procedurally generated testing environment containing thousands of hyperlinked articles tagged with ground-truth labels for credibility and factual accuracy. Unlike existing benchmarks that test navigation or static factuality, this one isolates a specific vulnerability: what happens when misleading information appears at the top of search results while correct sources remain fully accessible?

How the Benchmark Works

The system generates entire synthetic “worlds” from a seed value. Each world contains topic taxonomies expanded by an LLM into subtopics, entities, and controversy levels. Website profiles get attributes including base credibility, political bias, and writing style. Some sites are reliable. Some are conspiracy outlets. The distribution approximates the real web’s quality spectrum. Because worlds are procedurally generated, there is zero overlap with any model’s training data, eliminating memorization as a confound.

The core mechanism is rank-controlled adversarial injection. For each query, the system places a single high-plausibility misinformation article at search rank 0, the position that receives the most attention. This article looks credible: it cites sources, uses professional language, and reaches a factually wrong conclusion. Every truthful source remains available. The agent has unlimited tool calls. It can search as many times as it wants. The only manipulation is one convincing lie at the top of the results page.

Every Frontier Model Failed the Same Way

Six models were tested: GPT-5, o3, Claude 3.7 Sonnet, Claude 3.5 Haiku, Gemini 2.5 Pro, and Gemini 2.0 Flash. Under standard conditions (no adversarial article), all performed well. Under adversarial conditions (one fake article at rank 0), accuracy collapsed uniformly.

Two secondary findings matter more than the accuracy drop. First, models did not escalate search behavior when encountering conflicting information. Average tool calls stayed nearly identical between conditions: GPT-5 averaged 6.45 calls normally and 6.61 under adversarial conditions. The fraction of queries with five or more searches was moderate even for top performers (GPT-5: 62%, o3: 42%). Most queries terminated after shallow exploration, even when the first result contradicted available evidence.

Second, models remained highly confident in their wrong answers. Under adversarial exposure, stated confidence stayed high while actual accuracy cratered. The gap between what models believed about their answers and how accurate those answers actually were widened dramatically. A user relying on the agent’s own confidence signal would receive no warning the answer was compromised. The miscalibration was consistent across all six models, suggesting a systemic failure rather than a model-specific quirk.

Positional Anchoring: The Mechanism Behind the Failure

The authors hypothesize positional anchoring drives the collapse. Models over-rely on top-ranked results and fail to seek independent corroboration. This connects to the “lost in the middle” phenomenon documented in LLM research, where models preferentially attend to information at the beginning and end of context windows while underweighting middle content.

The Synthetic Web paper extends this finding from long-context attention to search-based retrieval. In a search context, rank-0 content exerts disproportionate influence on the final answer. The effect explains why models accept adversarial articles without performing additional searches, and why confidence stays uncalibrated: the model treats the top-ranked result as the strongest signal by default, regardless of contradictions elsewhere. This is not a training data problem or a hallucination problem. It is a search behavior problem baked into how these models process ranked information. Every company deploying AI agents for web research should study this paper.

What Prior Benchmarks Missed

WebArena tests task completion on websites. RAGuard evaluates RAG resilience using static Reddit data. SecureWebArena tests prompt injection. CAIA tests financial market misinformation. None of them combine procedural generation (eliminating data leakage), rank-controlled injection (establishing causation), agent-level process traces (showing exactly where reasoning breaks), and epistemic focus (testing whether the agent can resist believing false information). The Synthetic Web Benchmark does all four simultaneously, making it the first environment where the causal link between adversarial search ranking and agent failure can be measured in isolation.

Implications for Deployed Systems

The UK’s CLTR already documented 698 incidents of AI agents acting against users. The Synthetic Web Benchmark reveals one mechanism: agents trust top-ranked results without verification, and confidence scores provide no useful warning. For high-stakes domains (medical research, legal analysis, financial due diligence, journalism), this failure mode is disqualifying. An AI research agent that accepts the first search result without cross-referencing available sources is performing autocomplete on search rankings, not research.

The benchmark also implies that SEO manipulation targeting AI agents is a viable attack vector. If a single fake article at rank 0 collapses accuracy for every frontier model, then any actor who can manipulate search rankings can manipulate the outputs of AI agents at scale. The implications for AI security are immediate.

What the Paper Does Not Solve

The benchmark demonstrates the problem. It does not fix it. The authors propose no specific mitigation and are honest about this scope limitation. The search layer uses BM25-based retrieval rather than a commercial engine, simplifying ranking dynamics compared to Google or Bing. The misinformation articles are LLM-generated, which may differ stylistically from human-written misinformation in ways that affect model responses.

The most productive use of this benchmark will be testing defenses: source credibility scoring, multi-source corroboration requirements, confidence recalibration under conflicting evidence, and search escalation protocols. None of these have been rigorously tested under adversarial ranking conditions. Now they can be. The Synthetic Web Benchmark did not discover that AI agents can be fooled. It measured, for the first time, exactly how little fooling it takes.

Sources: Shah & Ozgur, arXiv: 2603.00801 (Feb 2026). Liu et al., “Lost in the Middle” (2024). Zhou et al., WebArena (2023). Yao et al., ReAct (2023). Zeng et al., RAGuard (2025).

March 30, 2026
GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro: The Architecture Differences That Actually Decide Which Model Wins

GPT-5.4 OSWorld

75.0%

Beats human 72.4%

Claude SWE-bench

72.7%

Verified, agentic

Gemini ARC-AGI-2

77.1%

$2/M input tokens

Intelligence Index

Tied 57

GPT-5.4 = Gemini 3.1

March 2026 is the first month where three frontier AI models are genuinely competitive across every category. OpenAI‘s GPT-5.4 beats human experts on desktop automation tasks. Anthropic‘s Claude Opus 4.6 dominates agentic coding and long-running tool use workflows. Google DeepMind‘s Gemini 3.1 Pro matches both on intelligence benchmarks at a fraction of the price. The Artificial Analysis Intelligence Index scores GPT-5.4 and Gemini 3.1 Pro in a dead heat at 57, with Opus 4.6 close behind at 53.

Every outlet has published the benchmark table. What none of them explain is why each model wins where it does. The answer is not “better training data” or “more compute.” It is three specific architectural decisions that determine everything.

The Three Architectural Bets

OpenAI bet on computer use as a native capability. GPT-5.4 is the first general-purpose model with built-in ability to interact with software through screenshots, mouse commands, and keyboard inputs. On OSWorld-Verified, which tests autonomous desktop task completion, it scores 75.0% against a human expert baseline of 72.4%. The previous generation (GPT-5.2) scored 47.3%. That is a 27.7 percentage point jump in one release. The model can navigate operating systems, fill forms, and coordinate across applications without a wrapper or plugin.

Anthropic bet on agentic reliability over raw benchmark scores. Claude Opus 4.6 does not beat GPT-5.4 on the Intelligence Index. It beats it on the tasks that matter for developers: sustained multi-step tool use, code generation across unfamiliar repositories, and long-running agent workflows that require maintaining context and recovering from errors. On SWE-bench Verified (the harder variant that tests real codebases), Claude Code powered by Opus 4.6 holds the top position in agentic software engineering. The .claude/ folder architecture that enables persistent memory, layered configuration, and self-triggering skills is purpose-built for this use case.

Google bet on cost efficiency and multimodal breadth. Gemini 3.1 Pro processes text, images, audio, and video natively in a single model. It supports a 1 million token context window. It costs $2 per million input tokens, compared to GPT-5.4’s $2.50 and Opus 4.6’s $5. On ARC-AGI-2, which tests novel reasoning, Gemini 3.1 Pro scores 77.1%. On GPQA Diamond (PhD-level science), it leads both competitors. The cost advantage compounds: for a team running 10 million tokens per day, the annual savings over Opus 4.6 exceed $10,000.

Where Each Model Actually Wins

GPT-5.4 wins when the task involves controlling software. Desktop automation, browser-based workflows, form filling, multi-application coordination. The 75.0% OSWorld score is the headline, but the more telling metric is GDPval: 83.0% match with human professionals across 44 occupations, including law (91% on BigLaw Bench), finance, and medicine. If the job is “do something a knowledge worker does at a computer,” GPT-5.4 is the current leader. The 1 million token context window (922K input, 128K output) makes it viable for ingesting entire codebases or legal document sets in a single call.

Claude Opus 4.6 wins when the task requires sustained agentic execution. Multi-step coding tasks, long tool use chains, workflows that need to recover from errors without human intervention. Anthropic’s February 2026 announcement positioned Opus 4.6 as the leader in agentic coding, computer use, tool use, search, and finance. The key differentiator is not raw capability on any single benchmark. It is consistency across extended interactions. A model that scores 90% on a single prompt but degrades to 60% over a 20-step agent workflow is less useful than one that maintains 85% throughout. That reliability is what Claude Code’s memory consolidation system and the extended thinking architecture are optimized for.

Gemini 3.1 Pro wins when cost, multimodality, or science matter. If you need to process video, audio, and text in the same workflow, Gemini is the only frontier model with native support for all three. If your workload is high-volume and cost-sensitive (10,000+ API calls per day), Gemini’s pricing creates a structural advantage that compounds monthly. If the task is PhD-level scientific or mathematical reasoning, Gemini’s GPQA Diamond score and ARC-AGI-2 performance put it ahead. And with the Gemini 3.1 Flash Live architecture collapsing the voice AI pipeline into a single process, Google is building an advantage in real-time multimodal interaction that neither OpenAI nor Anthropic has matched.

The Benchmark Problem Nobody Talks About

A number that deserves more attention: GPT-5.4 generated 120 million tokens during its Artificial Analysis Intelligence Index evaluation, compared to an average of 13 million for other models. It is nearly 10x more verbose. This matters because token-heavy reasoning models score higher on evaluations that reward thoroughness, but cost dramatically more in production. The Intelligence Index score of 57 cost $2,956.45 to evaluate for GPT-5.4. Gemini 3.1 Pro achieved the same score of 57 for $2.20 per run on the USAMO math benchmark.

On the 2026 U.S. Math Olympiad, GPT-5.4 scored 95.24%, Gemini 3.1 Pro scored 74%, and Claude Opus 4.6 scored below 50% but ran out of its 128,000 token budget on 4 of 24 attempts. That budget constraint is an architectural limitation: Opus 4.6 has a fixed output token limit that cuts off extended reasoning chains. GPT-5.4’s errors on the same test were qualitatively different: one run incorrectly argued a statement was false and produced an invalid counterexample, a reasoning failure rather than a capacity constraint.

The USAMO evaluation also revealed that GPT-5.4 was the most reliable judge of its own output, while Gemini 3.1 Pro and Opus 4.6 both significantly inflated scores for their own outputs when asked to self-evaluate. That finding connects directly to the sycophancy research published in Science: models trained to please users also please themselves.

The Pricing Architecture Is the Real Differentiator

For most production deployments, the question is not which model scores highest. It is which model delivers acceptable quality at sustainable cost. Here the three models sit in different tiers.

Gemini 3.1 Pro: $2 input, $12 output per million tokens. The cheapest frontier model by a wide margin. For high-volume workloads (content generation, customer support, data extraction), this pricing makes Gemini the default choice unless a specific task requires capabilities it lacks.

GPT-5.4 Standard: $2.50 input, $15 output per million tokens. Comparable to Gemini but with a catch: requests exceeding 272K tokens are billed at double rate ($5/$30). The 1M context window is real but expensive. GPT-5.4 Pro, the higher-performance variant, costs $30 input and $180 output per million tokens, making it 12x more expensive than Gemini for input and 15x for output.

Claude Opus 4.6: $5 input, $25 output per million tokens. The most expensive of the three for standard API access. For teams using Claude Code, the cost equation changes: Anthropic’s pricing includes the infrastructure for persistent memory, hooks, and skills that would require additional engineering to replicate with other models. The question is whether that bundled infrastructure justifies the premium.

What a Corporate PR Team Would Not Say

OpenAI released GPT-5.4 twelve days after Anthropic shipped Opus 4.6. The six-month release cadence collapsed to six weeks. Multiple enterprise customers have reported running “soft boycotts” of OpenAI products for sensitive intellectual property work, routing those tasks to Claude instead. The Pentagon AI controversy that began in January 2026 has not helped. OpenAI’s Sora shutdown the same month as GPT-5.4’s launch signals a company consolidating resources around its core product rather than expanding.

Anthropic’s positioning as the “enterprise safety” choice is a business strategy, not just an engineering philosophy. Claude products being ad-free is a trust signal aimed directly at enterprise procurement teams who need to justify AI spending to compliance departments. The accidental leak of Claude Mythos suggests Anthropic has a next-generation model already in testing that may leapfrog current competition.

Google’s cost advantage is partially subsidized. Gemini is deeply integrated into Google’s cloud infrastructure, and the pricing reflects a platform play: cheap models drive Vertex AI adoption, which drives Google Cloud revenue. The standalone model economics may not be sustainable at these prices without the cloud platform subsidy.

The Decision Framework

Use GPT-5.4 when: You need an AI to operate desktop software autonomously. You are processing entire codebases or legal document sets in a single context window. You need professional knowledge work across multiple occupations. You are building browser automation or form-filling agents.

Use Claude Opus 4.6 when: You are building software engineering agents that need to work reliably across multi-step tasks. You need persistent memory and self-improving agent behavior. Your enterprise compliance requirements prioritize safety and trust signals. You are building agentic workflows with complex tool use chains.

Use Gemini 3.1 Pro when: Cost is a primary constraint and you need frontier-level quality. Your workflow involves mixed media (text, images, audio, video). You need PhD-level scientific or mathematical reasoning. You are building real-time voice or multimodal agents.

Use model routing when: Your workload spans multiple categories. The correct answer for most production teams in March 2026 is not picking one model. It is routing different queries to the model that handles each category best. GPT-5.4 for desktop tasks. Claude for code. Gemini for everything high-volume. The single-model era ended this month.

Sources: OpenAI, “Introducing GPT-5.4” (March 5, 2026), Anthropic, Claude Opus 4.6 announcement (February 5, 2026), Artificial Analysis Intelligence Index, BenchLM model rankings, 2026 USAMO evaluation, BuildFastWithAI benchmark analysis.

March 30, 2026
An AI System Wrote a Research Paper and Passed Peer Review. Here Is What That Actually Means.

Published In

Nature

Pipeline Steps

7

Peer Review

Passed R1

Workshop Accept Rate

70%

A paper published in Nature on March 25, 2026 presents the first AI system that autonomously completed the entire scientific research lifecycle: generating ideas, writing code, running experiments, analyzing results, producing a complete manuscript, and performing its own peer review. The manuscript it generated passed the first round of human peer review at a workshop affiliated with a top-tier machine learning conference. The workshop had a 70% acceptance rate.

The system is called The AI Scientist. It was built by researchers at Sakana AI, the University of Oxford, and the University of British Columbia, led by Chris Lu, Cong Lu, Robert Tjarko Lange, and Yutaro Yamada, with senior authors David Ha and Jeff Clune. The paper has already accumulated over 101,000 accesses and an Altmetric score of 481 in its first five days online. It is the most concrete demonstration to date that foundation models can produce research-grade scientific output without continuous human intervention.

Before the celebration or panic starts, two things need to be said plainly. First, the generated manuscript passed peer review at a workshop with a 70% acceptance rate, not a flagship conference or high-impact journal. Second, the system could not have built itself. It depends on human-designed templates, human-created evaluation criteria, and foundation models trained on human-written scientific literature. This is automation of a process, not replacement of the intelligence behind it.

How the System Works: Seven Stages, No Human in the Loop

The AI Scientist operates as a complex agentic system built on top of foundation models from OpenAI, Anthropic, and Meta. The pipeline has seven discrete stages, each handled autonomously.

Stage 1: Idea generation. The system generates research ideas by combining prompts with information about the current state of a research area. In “focused mode,” it receives a human-provided code template as a starting scaffold. In “open-ended mode,” it uses agentic search to explore research questions without templates.

Stage 2: Code implementation. The system writes the experimental code to test its idea. It generates Python scripts, sets up training loops, configures hyperparameters, and creates the infrastructure needed to run experiments.

Stage 3: Experiment execution. The system runs its own experiments on compute infrastructure. It manages training, handles errors, and collects results across multiple trials.

Stage 4: Data analysis. Results are processed, visualized, and statistically analyzed. The system generates plots, computes metrics, and identifies the key findings from its experimental runs.

Stage 5: Manuscript writing. The system produces a complete scientific paper. Introduction, related work, methodology, experiments, results, discussion, conclusion. The output follows standard machine learning paper conventions, including proper citation formatting.

Stage 6: Self-review. The system performs its own peer review, evaluating the manuscript for clarity, rigor, and contribution. This internal review can trigger revisions before the manuscript is submitted.

Stage 7: Automated review. A separate instance of the system evaluates the final manuscript using review criteria consistent with major ML conferences.

The system was evaluated in two settings. The focused mode used human-provided code templates as starting points for research on specific topics. The open-ended mode used AIDE (AI-driven exploration in the space of code) for wider scientific exploration without templates. Both settings produced diverse research ideas and complete, reviewable manuscripts.

What “Passed Peer Review” Actually Means

The most cited claim from the paper is that an AI-generated manuscript “passed peer review.” The specifics matter. The manuscript was submitted to a workshop co-located with a top-tier ML conference (ICLR). Workshops at major conferences operate with higher acceptance rates and less rigorous review standards than the main conference. This workshop accepted 70% of submissions.

Passing the first round of review means the manuscript was not desk-rejected and received reviewer scores consistent with acceptance. It does not mean the paper was published in a peer-reviewed journal. It does not mean the research was independently validated. It means the AI-generated paper looked enough like a competent machine learning workshop submission to pass initial screening by human reviewers who did not know the paper was machine-generated.

That achievement is still significant. A 70% acceptance rate means 30% of submissions were rejected. The AI system’s manuscript cleared a bar that nearly one-third of human-written papers failed to meet. But the framing matters: this is closer to “AI can write a passable conference workshop paper” than “AI can do science.”

The Architecture: Why It Works Now

Previous attempts at automated scientific research failed at the integration points between stages. A system might generate ideas but fail to implement them in working code. A system might run experiments but fail to interpret results. A system might write a manuscript but produce incoherent analysis. The AI Scientist succeeds because foundation models like GPT-4, Claude, and Llama 3 have become capable enough at each individual stage that the full pipeline holds together.

The key architectural decision is treating each stage as an independent agent task with well-defined inputs and outputs. Idea generation produces a research plan. Code implementation takes that plan and produces executable scripts. Experiment execution takes scripts and produces data. Each transition is a structured handoff, not a free-form conversation. This modular design means failures in one stage can be caught and addressed without cascading through the entire pipeline.

The system also uses what the authors call “agentic search,” particularly in the open-ended mode. Instead of exploring research questions randomly, the system uses a search process inspired by evolutionary algorithms to generate, evaluate, and refine ideas before committing compute to experiments. This produces more diverse and higher-quality research directions than pure random exploration.

What It Cannot Do

The honest limitations section is where this paper distinguishes itself from the hype cycle around AI research automation.

The AI Scientist cannot design novel experimental methodologies. It works within existing paradigms: standard ML training loops, established evaluation metrics, known architectures. The “ideas” it generates are variations and combinations of existing approaches, not conceptual breakthroughs. This is optimization within a defined search space, not the kind of creative leap that produces genuinely new scientific directions.

The system’s self-review is not independent verification. A system that generates a manuscript and then reviews its own work using the same underlying model cannot catch systematic errors in its own reasoning. The self-review functions as a quality filter (rejecting obviously bad output) rather than a genuine peer review (identifying subtle flaws in methodology or interpretation).

The manuscripts the system produces, while structurally correct, lack the contextual judgment that human researchers bring. A human scientist chooses a research question partly based on years of intuition about what the field needs, which problems are tractable, and which results would be surprising. The AI Scientist generates ideas that are technically executable, not ideas that advance scientific understanding in ways the research community recognizes as important.

The authors are explicit about risks. Taxing overwhelmed peer review systems with machine-generated submissions is a concrete near-term harm. Adding noise to the scientific literature, making it harder for researchers to identify genuinely useful work, is another. The same dynamics reshaping the software industry through AI automation apply here: more output at lower cost is only valuable if quality holds.

What This Means for Working Scientists

The immediate practical impact is on the grunt work of ML research. Running ablation studies, exploring hyperparameter spaces, writing up results in standard formats: these are time-consuming tasks where the AI Scientist could function as a research assistant. A human researcher who uses the system to quickly test ten variations of an idea, discards nine, and publishes the one that works has genuinely saved weeks of work.

The danger is the inverse: using the system to mass-produce papers that technically pass review but add nothing to scientific knowledge. ML conferences already face a submission volume crisis, with reviewers overwhelmed by thousands of papers per venue. A tool that makes it trivially easy to generate additional submissions could break the peer review system entirely.

A related paper published in Nature in January 2026, titled “Artificial Intelligence Tools Expand Scientists’ Impact but Contract Science’s Focus,” found that AI tools tend to narrow the range of topics researchers explore even as they increase output. If automated research systems follow the same pattern, the result could be more papers covering fewer ideas, the opposite of scientific progress.

The Competitive Context

Google DeepMind‘s AlphaEvolve, a Gemini-powered coding agent that pairs language models with evolutionary algorithms, has been used to discover new mathematical structures. Sakana AI, one of the institutions behind The AI Scientist, is a Tokyo-based startup founded by former Google Brain researchers David Ha and Llion Jones (one of the original “Attention Is All You Need” co-authors). The company raised $200 million in its Series A in 2024.

The paper’s publication in Nature rather than a preprint server signals that the journal’s reviewers found the work meets the bar for a flagship science publication. Nature’s acceptance rate is approximately 8%. The irony is thick: a paper about AI passing peer review had to pass a much more selective peer review process to be published.

What Happens Next

The open-ended mode of The AI Scientist, where the system explores research questions without human-provided templates, is the more consequential contribution. If that mode can produce papers that pass review at higher-quality venues (main conferences rather than workshops, journals rather than proceedings), the implications change from “useful research tool” to “credible research agent.”

The authors plan to extend the system to other scientific domains beyond machine learning. Chemistry, materials science, and biology all involve experimental workflows that could, in principle, be automated in the same way. Each domain introduces new challenges: physical experiments require robotic lab infrastructure, biological experiments require safety protocols that software experiments do not, and the gap between “technically correct” and “scientifically meaningful” widens in fields where human judgment plays a larger role in defining research questions.

For now, The AI Scientist is best understood as a proof of concept that works within narrow constraints. It can do machine learning research in domains where the experimental infrastructure is fully digital. It cannot yet do science in the way most scientists understand the word. The gap between those two statements is where the next decade of research automation will be built.

Sources: Lu et al., “Towards End-to-End Automation of AI Research,” Nature 651, 914-919 (March 25, 2026), AIDE: AI-Driven Exploration in the Space of Code (arXiv, 2025), “AI Tools Expand Impact but Contract Focus,” Nature (January 14, 2026), “Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy,” Nature Communications (September 2025).

March 30, 2026