Tag: Automation

  • ToolHijacker Prompt Injection Hijacks LLM Agent Tool Selection 96.7% of the Time. Every Published Defense Failed.

    ToolHijacker Prompt Injection Hijacks LLM Agent Tool Selection 96.7% of the Time. Every Published Defense Failed.

    ToolHijacker Prompt Injection Hijacks LLM Agent Tool Selection 96.7% of the Time. Every Published Defense Failed.

    Researchers presented ToolHijacker at the Network and Distributed System Security Symposium on February 23, 2026 in San Diego. The paper (DOI 10.14722/ndss.2026.230675) describes the first prompt injection attack specifically designed to hijack the tool selection layer of LLM agents. The attacker inserts a single malicious tool document into a tool library. When any legitimate user query arrives, the agent’s two-step retrieval-then-selection pipeline picks the attacker’s tool instead of the correct one 96.7 percent of the time when the target model is GPT-4o and the shadow model used for optimization is Llama-3.3-70B.

    The attacker does not need access to the target LLM, the retriever, the tool library layout, or the top-k setting. This is a no-box attack. The retrieval hit rate on MetaTool is 100 percent, which means the malicious document reaches the candidate set on every query. The authors then tested six published defenses: StruQ, SecAlign, known-answer detection, DataSentinel, perplexity detection, and perplexity windowed detection. Every one failed to stop the attack at a practical rate.

    For an ecosystem where Model Context Protocol passed 97 million monthly SDK installs and tool marketplaces have become the dominant distribution layer for agent capabilities, this is the first empirical proof that tool-selection hijacking is an unsolved problem. Here is how the attack works, why the defenses fail, and what production MCP deployments can actually do about it today.

    How ToolHijacker works

    Authors Jiawen Shi, Zenghui Yuan, and colleagues formulate the attack as an optimization problem with two objectives. The malicious tool document must be retrieved into the candidate set during the retrieval phase, and then it must be selected by the LLM during the selection phase. The document is structured as two concatenated subsequences: a Retrieval-optimized sequence R, and a Selection-optimized sequence S.

    R is optimized to maximize semantic similarity with target task descriptions. The attacker does not have the real task descriptions, so the paper reconstructs them through a shadow framework. The attacker builds a shadow tool library, a shadow retriever, a shadow LLM, and a set of shadow task descriptions drawn from the target domain’s vocabulary. An LLM is then prompted to synthesize R by extracting and combining the core functional elements of the shadow task descriptions. The generated text is not gradient-optimized, which means it looks linguistically natural and evades perplexity-based detection.

    S is optimized to force the shadow LLM to select the malicious tool over benign alternatives, given that R has already caused the document to be retrieved. The paper evaluates two optimization methods. A gradient-based method uses HotFlip to mutate tokens toward maximum selection probability on open-weight shadow LLMs. A gradient-free method uses a Tree-of-Attack search strategy with an attacker LLM proposing candidate modifications iteratively. The gradient-free method works better against closed-source targets like GPT-4o. The gradient-based method works better against open-source targets like Llama-3-8B-Instruct.

    Transferability is the critical property. The authors tested whether a document optimized against one shadow LLM attacks a different target LLM. It does. With Llama-3.3-70B as shadow and GPT-4o as target, the gradient-free variant achieves 96.7 percent attack success rate on MetaTool. With Claude-3.5-Sonnet as target, the success rate is similarly high. Semantic patterns learned by different retrieval models overlap enough that a single crafted R generalizes across architectures.

    The test matrix covered 8 LLMs (Llama-2-7B-chat, Llama-3-8B-Instruct, Llama-3-70B-Instruct, Llama-3.3-70B-Instruct, Claude-3-Haiku, Claude-3.5-Sonnet, GPT-3.5, GPT-4o) and 4 retrievers across MetaTool and ToolBench benchmarks. The attack held across all combinations in the no-box setting.

    Why every tested defense failed

    Prevention-based defenses, StruQ and SecAlign, separate system prompts from user input structurally. They assume the attack surface is the user prompt. ToolHijacker’s malicious content lives inside a tool document that the retriever pulls into context. The document is not user input. Both defenses route around the attack rather than blocking it.

    Detection-based defenses have four tested variants. Known-answer detection fails completely, with a 100 percent false negative rate against ToolHijacker. The detection method looks for signatures characteristic of canonical attacks. ToolHijacker’s shadow-framework approach produces documents that do not match any known-answer pattern. DataSentinel catches some malicious documents but misses the majority. Perplexity detection and perplexity windowed detection work better against gradient-based optimization because gradient descent on discrete tokens produces lower-fluency text. Both fail against the gradient-free variant, which uses an LLM to synthesize fluent natural-language attacks.

    The pattern across all six defenses is a shared structural assumption: the attack surface is the prompt. Every defense was designed before tool-selection attacks were a studied class. ToolHijacker’s attack surface is the tool library itself, a location none of the defenses were built to monitor. The paper’s authors explicitly note that new defense strategies are needed and that the existing ecosystem is insufficient.

    Why this matters for the MCP ecosystem

    Model Context Protocol crossed 97 million monthly SDK downloads in March 2026, sixteen months after Anthropic introduced it. MCP tool servers are distributed through community marketplaces, vendor catalogs, and third-party plugin hubs. A compromised tool document in any reachable MCP server’s manifest can hijack every agent that retrieves it.

    The precedent exists. OpenClaw’s skill marketplace has accumulated 1,184 confirmed malicious packages and 104 CVEs, and the structural problems driving that number are not patchable. North Korea’s Contagious Interview campaign has published 1,700+ malicious packages across five ecosystems, demonstrating that supply-chain injection into developer tooling is an active, ongoing operation. LiteLLM’s March 24 compromise by TeamPCP showed that credential-stealing payloads can ride unpinned dependencies into AI infrastructure.

    ToolHijacker adds a new primitive to this threat model. The prior supply-chain attacks needed credential theft or code execution to monetize. ToolHijacker does not. The agent continues running its workflow. The user continues receiving what looks like legitimate output. Every decision simply routes through attacker-controlled tools, which means an attacker can extract information, poison outputs, or redirect actions without ever triggering a code-execution signal.

    For developers building MCP-native products today, the implication is direct. Tool libraries need provenance verification. Tool documents need content auditing beyond signature checks. The retrieval-then-selection pipeline needs a middleware layer between retrieval and tool execution that cross-checks selected tool against expected task category. None of this exists in standard MCP client implementations as of April 2026.

    Practical mitigations available today

    The paper’s authors recommend four measures. First, restrict tool libraries to vetted and cryptographically signed sources, which turns an open marketplace into a closed-gate distribution. Second, monitor tool descriptions for anomalies using ensemble detection that combines multiple signals rather than any single filter. Third, log and audit tool invocation patterns in production and alert on abnormal selection distributions, which catches attacks that succeed in the lab but produce tell-tale behavioral signatures in deployed systems. Fourth, treat any tool library that accepts third-party submissions as untrusted input, regardless of the maintainer’s reputation.

    Meta’s Agents Rule of Two, published on October 31, 2025, offers the most conservative operational mitigation. No single agent session should combine all three properties simultaneously: access to private data, exposure to untrusted content, and the ability to take externally-observable state-changing actions. ToolHijacker attacks the second property, so the defense is to constrain the first and third. An agent that reads untrusted tool documents should not also have access to user credentials or the ability to send emails. This is coarse but implementable today, and it does not require waiting for a ToolHijacker-specific defense.

    For production systems that cannot avoid combining all three properties, a second-pass verification layer is feasible. After the LLM selects a tool, a separate check compares the selected tool’s category and parameters against the expected task category. If the user asked to summarize an email and the selected tool is a file-write operation, block the call and log the anomaly. This does not solve the problem but it catches the most obvious attacks.

    What this means for agent marketplace governance

    The structural assumption underlying MCP, OpenClaw’s skill registry, and every tool-hub distribution model is that tool authors are identifiable and that malicious tools can be removed when discovered. ToolHijacker breaks both halves of that assumption. A malicious tool document can be crafted by an attacker who never publishes a tool through normal channels. It can be slipped into a legitimate repository by compromising any contributor account. And because the attack signal is semantic (the document reads like a useful tool description), static scanning of package contents does not flag it.

    Marketplace operators have three options. First, require cryptographic signing by identity-verified tool authors, which raises the attacker’s cost but does not stop insider attacks. Second, implement runtime selection auditing that compares tool selection patterns across users and flags outliers, which catches attacks in production but does not prevent first-use impact. Third, move from open marketplaces to curated catalogs with human review on every submission, which trades ecosystem velocity for security. None of these are trivial to implement. All of them are likely to be mandated by enterprise customers within twelve months.

    Limitations the paper acknowledges

    Evaluation ran on MetaTool and ToolBench benchmarks, not on production MCP deployments. Real-world tool curation, rate limiting, and output validation may reduce attack success in ways the paper does not measure. The shadow-framework reconstruction requires some knowledge of the target domain’s task description distribution, so attacks on narrow, proprietary, or highly-specialized agent workflows may be harder to craft than attacks on general-purpose agents.

    Adaptive targets that retrain regularly or rotate tool libraries may exhibit different vulnerability profiles. The paper does not test ToolHijacker against models equipped with activation-level defenses. Concurrent research, including architecture-level isolation approaches similar to Apple’s Private Cloud Compute, may offer mitigation paths the paper does not address.

    What happens next

    The NDSS 2026 publication will push tool-selection security onto the OWASP LLM Top 10 in the 2026 or 2027 revision. Concurrent work signals a research pivot from prompt-level attacks to tool-level attacks. Faghih et al. 2025 showed that suffix appending to tool descriptions is enough to bias selection. Beurer-Kellner and Fischer 2025 demonstrated that MCP tool descriptions can influence other tools’ behavior through cross-tool prompt injection. The Log-To-Leak paper published on OpenReview in October 2025 demonstrated covert data exfiltration through tool invocation decisions, even when the agent’s output looks normal. The Synthetic Web Benchmark showed that a single adversarial document can collapse frontier AI agent accuracy to zero, and tool hijacking is the logical next step from document hijacking.

    The defensive gap will close. Activation-level detection, verified tool registries, and tool-behavior attestation are all plausible research directions. But closing the gap will take months, and the research-to-production lag for security tooling in AI infrastructure is historically 12 to 24 months. In the meantime, every MCP-native agent product shipping today operates with a class of vulnerability that no major vendor has a deployed countermeasure against. The question is not whether ToolHijacker-style attacks will appear in the wild. The question is how quickly the first documented production incident surfaces, and which MCP marketplace is the vector.

  • GLM-5.1 Ran Autonomously for 8 Hours Across 6,000 Tool Calls. How It Beat Claude Opus 4.6 on SWE-Bench Pro and Lost on Verified.

    GLM-5.1 Ran Autonomously for 8 Hours Across 6,000 Tool Calls. How It Beat Claude Opus 4.6 on SWE-Bench Pro and Lost on Verified.

    GLM-5.1 Ran Autonomously for 8 Hours Across 6,000 Tool Calls. How It Beat Claude Opus 4.6 on SWE-Bench Pro and Lost on Verified.

    Z.ai released GLM-5.1 open-source under the MIT license on April 7, 2026. The 744-billion parameter Mixture-of-Experts model scored 58.4 on SWE-Bench Pro, beating Anthropic’s Claude Opus 4.6 at 57.3 and OpenAI’s GPT-5.4 at 57.7. On a separate test, it ran 655 iterations of autonomous optimization against VectorDBBench, executed more than 6,000 tool calls without human intervention, and finished at 21,500 queries per second. That number is six times the best single-session result from any other model, Claude Opus 4.6 included.

    On SWE-Bench Verified, the older and more widely cited coding benchmark, GLM-5.1 scored 77.8. Claude Sonnet 4.6 scored 79.6. Claude Opus 4.6 scored 80.8. Same model, opposite ranking. The contradiction is not a bug in either benchmark. It is a feature of how Z.ai optimized the post-training pipeline and a warning that leaderboard numbers in April 2026 depend almost entirely on which test rig you pick.

    Here is what actually happened during the 8-hour autonomous run, why the two benchmarks disagree, and what developers should do about it.

    The 8-hour autonomous run

    VectorDBBench is one of the stress tests Z.ai built into its GLM-5.1 evaluation suite. The methodology is specific. The model receives a Rust skeleton for a vector database and empty implementation stubs. It then uses tool-call-based agents to edit code, compile, run benchmarks, and profile the results. Each iteration represents one autonomous cycle of decision, action, and observation.

    GLM-5, the base model released on February 11, 2026, plateaued at 3,547 queries per second. GLM-5.1 kept going.

    At iteration 90, the model autonomously shifted strategy. It moved from full-corpus scanning to IVF cluster probing with f16 vector compression. That single decision reduced per-vector bandwidth from 512 bytes to 256 bytes and jumped performance to 6,400 QPS. At iteration 240, the model introduced a two-stage pipeline of u8 prescoring and f16 reranking, reaching 13,400 QPS. By iteration 655, the system had settled at 21,500 QPS. Every optimization was independently audited to confirm it worked on arbitrary new inputs and did not exploit benchmark-specific quirks.

    KernelBench tells the same story in a different domain. GLM-5.1 delivered a 3.6x geometric mean speedup across 50 GPU kernel problems, continuing to make progress past 1,000 tool-use turns. Claude Opus 4.6 leads this benchmark at 4.2x, but its improvement plateaus earlier. The gap between the two narrows as session length increases. For an 8-hour run, the productive horizon is what matters, and GLM-5.1 extended it further than any previously measured open model. Z.ai’s technical report, “GLM-5.1: Towards Long-Horizon Tasks,” describes the pattern as an autonomous experiment, analyze, and optimize loop in which the model proactively runs benchmarks, identifies bottlenecks, adjusts strategies, and improves iteratively.

    MWW covered the related finding that a single edit-tool change improved 15 LLMs at coding by up to 60 percentage points. The GLM-5.1 result suggests the test environment and the post-training are interacting: the model was optimized for long-horizon stability, and the evaluation measured long-horizon stability.

    Why long-horizon stability is the harder problem

    A 30-second code completion lives or dies on a single forward pass. An 8-hour autonomous run lives or dies on the cumulative probability of not losing the plot across thousands of decisions. The failure modes are different. Short sessions fail on knowledge gaps, hallucination, or tool-call syntax errors. Long sessions fail in three distinct ways. The first is goal drift, where the model forgets the original objective. The second is strategy oscillation, where the model switches between incompatible approaches. The third is error accumulation, where small mistakes compound until the state is unrecoverable.

    Z.ai’s technical report attributes GLM-5.1’s extended horizon to post-training decisions aimed at three targets. First, goal alignment is reinforced explicitly during post-training rather than being inherited from pretraining. Second, scratchpad state is managed across tool calls rather than regenerated each time, which reduces the cost of remembering prior decisions. Third, the model is trained to evaluate its own intermediate progress against the original objective, which creates a built-in checkpoint mechanism. None of these are architectural changes from GLM-5. They are post-training behavior shifts layered on the same 744B-parameter MoE base.

    The practical consequence: for an agent operator building workflows that run unattended, model selection is now a function of how long the task is expected to run. Under 30 minutes, Claude Opus 4.6’s raw reasoning quality wins. Over 4 hours, GLM-5.1’s drift resistance starts to matter more than raw capability.

    Why SWE-Bench Pro and SWE-Bench Verified disagree

    The two benchmarks measure different things. SWE-Bench Verified is a curated set of GitHub issues where the problem statement, test cases, and acceptance criteria were validated by human reviewers to be unambiguous. The evaluation uses a fixed instruction prompt. Models get one shot at each issue, with no iteration. The benchmark rewards tight, correct, single-pass problem solving.

    SWE-Bench Pro is the newer benchmark Z.ai cites for its top-line score. It uses a 200,000-token context window, allows tailored instruction prompts, and tests real-world industrial code repair on larger repositories. It rewards extended context use, prompt engineering, and iterative repair within a session. GLM-5.1 optimized its post-training for this profile. Claude Opus 4.6 optimized for the Verified profile.

    The evaluation framework matters as much as the model. On Terminal-Bench 2.0, GLM-5.1 scores 63.5 when measured with the Terminus-2 framework. It scores 66.5 with the Claude Code framework. Three-point swing, same task, same model, different test environment. Claude Code is tuned to Claude’s tool-call patterns, and GLM-5.1 inherits the lift because its tool-call format is compatible. Developers reading benchmark numbers in April 2026 need to ask three questions: which framework, which prompt, which context length. Any of those three variables alone can produce a multi-point swing.

    Z.ai reports an internal coding score of 45.3 against Claude Opus 4.6 at 47.9 on its own proprietary benchmark. The methodology uses Claude Code as the framework, which favors Claude’s tool-call conventions. That GLM-5.1 reached 94.6 percent of the Opus score on an away-game setup is either a sign the model is genuinely close or a sign the benchmark needs an independent replication. Both readings are open.

    The hardware story nobody is calling out

    GLM-5 and GLM-5.1 were trained on 100,000 Huawei Ascend 910B chips using the MindSpore training framework. Zero NVIDIA GPUs. Z.ai was placed on the US Entity List in January 2025, which restricted the company’s access to American silicon.

    A model trained entirely on non-NVIDIA hardware scoring within 1.1 points of Claude Opus 4.6 on SWE-Bench Pro contradicts a load-bearing assumption in Western AI discourse: that frontier model training requires NVIDIA. The assumption was reasonable twelve months ago. It is no longer reasonable. Chinese labs have now demonstrated a validated post-training pipeline on domestic silicon, and the result is a model that open-weight US competitors cannot match on the Pro benchmark. The geopolitical implication extends beyond Z.ai. Any future US export control aimed at restricting Chinese AI capabilities must account for the fact that the restricted path has already produced a competitive model.

    What developers should actually do with this

    GLM-5.1 costs $1.40 per million input tokens and $4.40 per million output tokens via the Z.ai API. A cache discount brings repeated input to $0.26 per million. Off-peak promotional pricing through April 2026 lets developers use standard rates during Beijing off-peak hours. The GLM Coding Plan subscription starts at $3 per month at promotional pricing and $10 standard. Compare to Claude Opus 4.6 at $15 per million input tokens and $75 per million output. The input cost ratio is roughly 10x cheaper on GLM-5.1. The output cost ratio is 17x cheaper.

    Compatibility is already broad. GLM-5.1 plugs into Claude Code, OpenCode, Kilo Code, Roo Code, Cline, and Droid as a drop-in model via the GLM Coding Plan. The API is OpenAI-compatible, which means existing routing layers work without modification. Perplexity’s Computer product already routes across 19 models, and a GLM-5.1 addition is trivial. Grok 4.20’s multi-agent architecture offers another orchestration pattern for teams combining open and closed models.

    Self-hosting requires 8 H100 GPUs or equivalent at minimum. The FP8 quantized version roughly halves memory requirements. Local inference frameworks vLLM and SGLang both support GLM-5.1 natively.

    The practical use case is long-horizon iterative work. Database tuning, kernel optimization, large refactors, and any task where drift over 1,000+ tool calls destroys the session. For reasoning-heavy single-shot tasks, Claude Opus 4.6 still leads. GPQA-Diamond gap is 8 points in Claude’s favor. BrowseComp gap is 16 points. For fast single-shot code completion, GLM-5.1 is the slowest model in the comparison at 44.3 tokens per second.

    Limitations

    The base model’s SWE-Bench Pro number is externally validated. The internal 45.3-versus-47.9 comparison is self-reported and not independently replicated as of April 13, 2026. Z.ai has a track record of internal numbers holding up under scrutiny, since GLM-5’s SWE-Bench Verified score of 77.8 was externally confirmed to be the highest open-source score on that benchmark, but treat the 94.6 percent figure as a preliminary claim until third-party labs publish.

    Context window is 200,000 to 256,000 tokens depending on configuration, compared to 1 million on Claude Opus 4.6. Multimodal input support is absent. Peak-hour quota on the Coding Plan consumes at three times the standard rate during Beijing afternoon hours, which turns a $3-per-month plan into a much steeper effective cost for developers in incompatible time zones.

    The MIT license is real and enforceable, but Chinese regulatory overlay on foundation-model deployment creates a separate risk axis for production users outside China. US enterprise legal teams will treat a Chinese-trained, Chinese-hosted model differently from a US-trained alternative, regardless of license terms. Self-hosting bypasses the regulatory question but does not address provenance concerns about training data.

    What happens next

    Anthropic’s unreleased Claude Mythos Preview reportedly scores 77.8 on SWE-Bench Pro. That is 19.4 points ahead of GLM-5.1. If the cadence of recent releases holds, that gap closes in months, not years. Z.ai shipped GLM-5 on February 11, Turbo on March 15, the GLM-5.1 API on March 27, and the open weights on April 7. Four releases in two months. GPT-5.4 and Gemini 3.1 Pro both have coding-specific responses planned for the second quarter of 2026.

    The benchmark contradiction at the heart of this story foreshadows the rest of 2026. Leaderboard rankings will fragment by framework, prompt, and context length. Vendors will ship self-scored benchmarks on their own test rigs. Developers will need their own evaluation pipelines on their own code to decide which model to deploy. A single authoritative benchmark number is becoming less useful by the month. Both of GLM-5.1’s headline numbers, 58.4 on Pro and 77.8 on Verified, are correct. They just answer different questions.

  • Perplexity Computer Is a Productized Router on Top of Research That Has Been in the Open for Two Years. Here Is What It Actually Does.

    Perplexity Computer Is a Productized Router on Top of Research That Has Been in the Open for Two Years. Here Is What It Actually Does.

    Perplexity Computer Is a Productized Router on Top of Research That Has Been in the Open for Two Years. Here Is What It Actually Does.

    Perplexity launched Computer on February 25, 2026 as a multi-model orchestration platform that coordinates 19 frontier AI models from OpenAI, Anthropic, Google, xAI, and several Chinese open-source labs. The product is priced at $200 per month for Max subscribers, targeted at long-running agentic workflows, and built around the thesis that frontier models are specializing rather than commoditizing. That thesis, and the marketing framing around 19 models in one box, has generated most of the launch coverage.

    For an ML engineer evaluating Computer as a production artifact, the marketing framing is the least useful part. The question that matters is whether the underlying routing harness is a qualitatively new piece of infrastructure or a productized version of research that has been in the open for two years. The answer is the second one, with one genuinely novel addition that almost nobody has discussed. Computer is also one of three different architectural bets on frontier multi-agent orchestration shipping within a six-week window, and the three are architecturally distinct in ways the coverage has not separated.

    This article walks through the routing function, the leader-worker assignment, the production constraints that come with a server-side sandbox, and the open-sourced post-training pipeline Perplexity built to strip Chinese models of state content before deploying them. It compares each piece to the research precursors it resembles: DSPy, RouteLLM, FrugalGPT, Mixture of Agents, LangGraph, and LiteLLM. And it places Computer alongside the other two architectural choices shipping right now from Meta and xAI. It ends with where Computer differs from the Personal Computer companion product Perplexity announced at Ask 2026, which solves a different problem on different hardware.

    The model stack, as published

    Perplexity’s own launch blog is explicit about which model handles what. As of publication, Computer runs Claude Opus 4.6 as the core reasoning engine. The sub-agent assignments are: Gemini for deep research and creating new sub-agents, Google’s Nano Banana image model for image generation, Veo 3.1 for video, Grok for fast lightweight tasks, and ChatGPT 5.2 for long-context recall and wide search. Perplexity’s own search API and ranking infrastructure sits underneath all of them. The remaining models, Perplexity says, are assigned the best models for specific tasks with the harness allowed to swap models as new ones ship.

    This is a role-based assignment, not a cost-optimized routing decision at query time. The harness does not evaluate every query against every model and pick the cheapest path that meets quality. It assigns fixed roles to fixed models and lets the leader decompose a task into sub-tasks that match those roles. The user can override model selection per sub-agent if they want finer control.

    The role-assignment pattern has a clear research precursor. Wang et al. published Mixture of Agents in June 2024, describing a multi-layer architecture where proposer agents generate candidate responses and aggregator agents synthesize them into a final output. The MoA paper showed that a stack of open-source models could beat GPT-4 on AlpacaEval 2.0 by coordinating multiple models across rounds. Perplexity Computer is a productized version of this pattern with a single aggregator at the top, specialized sub-agents underneath, and longer multi-turn continuity.

    The leader-worker split also resembles the AutoGen multi-agent pattern that Microsoft Research published in October 2023, where a user proxy and assistant agents interact in a conversation-driven workflow. Both of these are research papers with working implementations. Neither was productized at the frontier-model tier until Computer shipped. That is the novelty: not the pattern, but the productization.

    What the routing function actually does

    The routing function inside Computer, as described in Perplexity’s own statements and in the VentureBeat launch coverage, is closer to decomposition plus dispatch than to classical routing. The leader model receives the user’s high-level objective, decomposes it into sub-tasks, and assigns each sub-task to the model tagged for that capability. Task types map to model roles. Image generation goes to Nano Banana. Long-context retrieval goes to GPT-5.2. Search goes to Perplexity’s own search stack. Reasoning and coordination stay on Claude Opus 4.6.

    The research comparison that matters here is not Mixture of Agents. It is the frugal-routing literature. FrugalGPT, published by Chen, Zaharia, and Zou in 2023, proposed a cascade where queries are first sent to the cheapest model, then escalated to progressively larger models only if the cheap model’s output fails a verifier check. RouteLLM, published by Ong et al. in 2024, trained a learned router to predict which model would be sufficient for a given query based on cost-quality trade-offs.

    Computer does not use cascade-to-verifier, and it does not appear to use a learned query-to-model classifier. It uses static role assignment at the leader level. That is simpler than FrugalGPT, simpler than RouteLLM, and easier to explain to users. It is also more expensive per query in the average case, because every non-trivial request touches the most expensive model in the stack. A FrugalGPT-style cascade could in principle handle 60 to 70 percent of Computer’s query volume at much lower cost, but Perplexity has not published data showing Computer does this.

    This matters for the $200 per month price tag. The unit economics of a static-role harness with Claude Opus 4.6 as the leader are fundamentally bounded by Anthropic’s output pricing. Opus 4.6 at $75 per million output tokens is the reason FrugalGPT-style cascades exist in the research literature. Computer either eats those costs, passes them through its opaque credit system, or eventually moves to a cost-optimized router variant. All three are possible. None of them are publicly committed to.

    Three architectural choices in the frontier multi-agent space

    Computer is one of three different architectural bets on multi-agent orchestration shipping at the frontier right now. All three ship within six weeks of each other and solve the same basic problem through different mechanisms.

    The first is in-model parallelism. Meta’s Muse Spark, released April 8, 2026 from Meta Superintelligence Labs, introduced a mode called Contemplating that spawns multiple subagents inside a single model. Alexandr Wang described the design principle directly: to spend more test-time reasoning without drastically increasing latency, scale the number of parallel agents that collaborate to solve hard problems. Muse Spark’s subagents are not separate model instances. They are parallel reasoning paths inside one model, synthesized into a final answer through a mechanism Meta has not yet published. The parallelism happens under a single weight matrix. The full architectural story, including the unified multimodal representation and the scaling-law claim, is in the Muse Spark breakdown.

    The second is replica parallelism. xAI’s Grok 4.20 multi-agent runs 4 or 16 instances of the same base model in parallel, with a leader agent synthesizing a final response from the ensemble. Sub-agent state is encrypted and not returned to the caller by default. The agents are all the same model. What differs is the prompt each instance receives and the internal deliberation the leader performs before committing to a response. The full mechanism is covered separately, including the production constraints that make this hard to drop into existing stacks.

    The third is cross-model orchestration, which is what Perplexity Computer actually ships. The subagents are different models entirely: Opus 4.6 as leader, Gemini for research, GPT-5.2 for long context, Nano Banana for images, Veo 3.1 for video, Grok for speed, plus a rotating cast of Chinese open-source models. The leader does not choose a parallel path through one model’s weights. It dispatches each subtask to the model tagged for that capability. The parallelism is across entirely separate weight matrices from competing labs.

    These three choices have different failure modes and different cost structures. In-model parallelism is bounded by the single model’s ceiling. A Muse Spark that cannot solve a specific coding problem cannot solve it by adding more Contemplating subagents. Replica parallelism has the same limit: 16 Grok instances cannot exceed what one Grok instance knows. Cross-model orchestration is the only one of the three where the ensemble can legitimately exceed any individual component, because the components are different models with different training data and different strengths. It is also the only one where the cost of a single query scales with the external pricing of every model in the stack, not just the one running the harness.

    The sandboxed server-side harness

    Computer runs every sub-agent inside an isolated compute environment with a real file system, a browser, and a set of tool integrations. Tasks can run for hours, days, or months. The user can spawn multiple Computer instances in parallel. The architecture resembles a managed version of what LangChain’s LangGraph and Microsoft’s AutoGen do in self-hosted code, except the compute and the state live on Perplexity’s servers instead of the user’s.

    The server-side choice has two concrete implications for ML engineers.

    First, you cannot inspect sub-agent state the way you can in a self-hosted LangGraph deployment. LangGraph exposes the full execution graph, the state at each node, and the transition history as first-class data the developer can query. Computer does not, at least not at launch. The harness is a product, not a framework, and the internal state is opaque to the caller beyond the final output and a credit bill. This is similar in structure to the encrypted sub-agent state trust model that xAI shipped with Grok 4.20 multi-agent, where only the leader agent’s output is exposed by default and the intermediate reasoning is encrypted.

    Second, the long-running task model changes the cost prediction problem. A traditional API call has a bounded cost you can estimate from input length. A Computer task can run for a week, spawn dozens of sub-agents, invoke search APIs against paid endpoints, and call image and video generation models. The credit system Perplexity uses to bill for this is not published as a line-item table. Early users have reported that task complexity drives credit burn in hard-to-predict ways. For an ML engineer building on top of Computer, this is closer to spot-pricing a compute cluster than calling an LLM API.

    The unpredictability of long-running task billing is a distinct research problem of its own. Some of the open questions about what happens when agent tasks fail or misfire are directly addressed by the Agentic Risk Standard work on escrow and underwriting for AI agent financial transactions. Perplexity Computer is one of the first commercial deployments where that research is going to get tested against production failure modes at scale.

    The post-training pipeline nobody is writing about

    This is where Computer has a piece of infrastructure that is genuinely new and that Perplexity open-sourced. Perplexity’s orchestration stack uses Chinese open-source models for some sub-agent roles. The launch material confirms this and names the broad category without publishing the full model list. What Perplexity did before deploying those models is unusual: they built a post-training pipeline that runs the open weights through a correction procedure designed to remove what Perplexity’s engineers called state-infused propaganda, then publish the methodology.

    The pipeline has three technical moves, each of which is worth a paper by itself.

    First, Perplexity runs all inference for these models from its own U.S. data centers. The weights leave China. The training data that produced them does not get re-introduced into the deployment. This is a compliance and trust argument as much as a technical one, but the engineering trade is real: Perplexity is taking on the inference cost of models Alibaba, DeepSeek, and others subsidize on their own infrastructure.

    Second, Perplexity applies a post-training correction step to the weights. The details in the public material are limited, but the pattern is consistent with targeted preference tuning against a small curated dataset of politically sensitive topics where the open weights produce responses aligned with Chinese state positions. Supervised fine-tuning on counter-examples followed by RLHF or DPO-style preference optimization is the obvious mechanism. Perplexity did not disclose the exact loss function or the dataset size.

    Third, Perplexity built custom inference kernels for the corrected models. This is the piece that an ML infrastructure engineer should pay attention to. Custom CUDA kernels for Chinese open-source models are usually built inside the original labs, tuned for the labs’ own hardware, and released alongside the weights. Perplexity rebuilt them externally. The engineering cost is non-trivial and the motive is presumably cost optimization at scale.

    Perplexity open-sourced the depropagandization methodology for other teams to use. The act of open-sourcing this piece is the genuinely novel contribution. No other commercial AI lab has published a repeatable recipe for taking frontier open-source weights from a geopolitical competitor and retraining them against state-aligned content before deployment. The research literature on model poisoning detection and politically sensitive fine-tuning is substantial, but Perplexity is the first commercial deployment to turn it into a published pipeline. The closest precedent in the research literature is the work on detoxification fine-tuning for earlier LLMs, and that work does not target political content specifically.

    For an ML engineer evaluating Computer, this piece is worth more than the 19-model headline. If you build on Chinese open-source weights in a regulated environment, Perplexity just handed you a published methodology you can fork.

    Where Computer fits in the harness landscape

    The comparison matrix ML engineers should care about:

    LiteLLM is a unified API wrapper over dozens of model providers. It does not orchestrate, route intelligently, or coordinate multi-agent workflows. It normalizes calling conventions. Computer is not a LiteLLM competitor.

    LangGraph is a state-machine framework for multi-step agent workflows that you run on your own infrastructure. It exposes full state, supports custom routing, and integrates with any model through any provider. Computer is a managed version of the same idea with closed state and a fixed model stack.

    DSPy, from the Stanford NLP group, is a programmatic framework for building and optimizing LLM pipelines where the prompt, the model, and the routing are all compiled against a training set to maximize a target metric. DSPy is the research framework most similar to what Computer appears to do under the hood, and nothing Perplexity has published suggests Computer uses anything like DSPy’s compilation approach.

    AutoGen, from Microsoft Research, is an open-source multi-agent conversation framework. It is the closest research precursor to Computer’s leader-worker pattern.

    RouteLLM and FrugalGPT are cost-optimized routing systems. Computer does not appear to implement either at launch.

    Mixture of Agents is the specific architecture pattern Computer’s leader-sub-agent design most resembles.

    The honest read is that Computer is a productized harness combining AutoGen-style multi-agent coordination with MoA-style role assignment, delivered as a managed service with a credit-based billing system. It is not a new piece of research. It is a new piece of commercial infrastructure, and its cost structure is bounded by Anthropic’s Opus pricing unless Perplexity eventually ships a cost-optimized router.

    What this sets up for the rest of 2026

    The interesting thing about Computer is not whether it wins as a product. It is whether the multi-agent harness becomes the default abstraction above frontier models, the way Kubernetes became the default abstraction above containers. The research literature has been converging on this shape for two years. Perplexity is the first commercial lab to productize it at the frontier-model tier. Anthropic’s Claude Code sub-agents and the .claude folder protocol are a related but distinct bet on exposing the harness as inspectable files on the developer’s own machine. xAI shipped encrypted server-side multi-agent for Grok 4.20. Google’s Gemini has Deep Research mode. OpenAI has Codex and parallel function calling.

    Computer is not the only bet on the harness layer. Meta’s Muse Spark closed the open-source gates to protect the Contemplating architecture while the scaling law gets validated. xAI exposed replica parallelism as a closed commercial endpoint. Anthropic built an inspectable file-based harness in .claude/. Perplexity productized cross-model orchestration with an opaque credit system. All four labs agree that the harness matters. None of them agree on where the harness should live, who should be able to inspect it, or how it should be priced.

    Whichever abstraction wins at the harness layer is going to matter more for the next round of ML engineering than the base model benchmarks will. Computer is one bet, with a static role assignment, an opaque credit system, and a genuinely new post-training pipeline for Chinese open-source weights. The research it is built on is free to read. The methodology for the post-training piece is now open source. The rest of the harness is $200 a month.

  • Every Grok 4.20 Explainer Named the Four Agents. xAI’s Documentation Names Zero of Them.

    Every Grok 4.20 Explainer Named the Four Agents. xAI’s Documentation Names Zero of Them.

    Every Grok 4.20 Explainer Named the Four Agents. xAI’s Documentation Names Zero of Them.

    The single most repeated fact about xAI’s Grok 4.20 multi-agent release is false. Every major outlet covering the February 17, 2026 launch, and most of the follow-up coverage through March, describes four specialized AI agents named Grok, Harper, Benjamin, and Lucas that think in parallel and debate each other before synthesizing a response. The names come from an early speculation post on X. They are nowhere in xAI’s official documentation, nowhere in the xAI SDK, nowhere in the API schema, and nowhere in the model card.

    What xAI actually shipped is architecturally different from the parliament-of-four story. The model ID is grok-4.20-multi-agent. It is configurable at either 4 or 16 agents via a single parameter. One leader agent orchestrates the rest. Sub-agent intermediate state is encrypted and not returned to the caller by default. The model does not support the OpenAI Chat Completions API. It does not accept client-side function calling or custom tools. It ignores max_tokens. These are real production constraints that determine whether you can drop this into an existing agent stack, and almost nobody covering the launch has mentioned them.

    This article reads the documentation the way a developer would. It corrects the agent-name error, explains the leader-orchestration mechanism, walks through the 4-versus-16 configuration, covers the pricing math, and ends with the benchmarks and limits that actually matter.

    The architecture xAI published

    The grok-4.20-multi-agent model is available through xAI’s Responses API and via the xAI SDK. The documentation describes it as Realtime Multi-agent Research and frames it as an orchestration pattern rather than a new base model. In the docs’ own words: when you send a request to the multi-agent model, multiple agents are launched to discuss and collaborate on your query. Each agent contributes its own perspective, reasoning, and findings. A designated leader agent is responsible for synthesizing the discussion and presenting the final answer back to you.

    That is the entire described mechanism. There is no list of named personas. There are no fixed specializations. There is a leader and some sub-agents, and the number of sub-agents is a configuration parameter.

    xAI exposes the agent count through two compatible spellings. Callers using the xAI SDK set agent_count directly to 4 or 16. Callers using the OpenAI-compatible Responses API or the Vercel AI SDK set reasoning.effort to "low" or "medium" for 4 agents, or "high" or "xhigh" for 16. Every other value is rejected.

    The 4-agent setup is positioned for focused queries. The 16-agent setup is positioned for multi-faceted research. xAI’s own documentation flags the trade directly: more agents means deeper analysis at the cost of higher token usage and latency.

    Encrypted scratchpad state

    The output behavior matters because it determines what you can audit and what you pay for.

    By default, only two things come back from a multi-agent request. The leader agent’s final text. And any server-side tool calls the leader made. Everything the sub-agents thought, searched, cited, or debated is encrypted and discarded from the visible response. The docs are explicit: all sub-agent state, including their intermediate reasoning, tool calls, and outputs, is encrypted and included in the response only when use_encrypted_content is set to True in the xAI SDK.

    Setting use_encrypted_content=True returns an opaque blob that you cannot read but that you can pass back into the next turn of a multi-turn conversation. The blob preserves the full deliberation context so the agents can continue their work on a follow-up query. If you do not pass it back, the next turn starts cold.

    This is an unusual trust model. A developer watching a sub-agent debate over a production task cannot see what the sub-agents actually said. They get the leader’s synthesis and a bill for all the reasoning tokens spent underneath. If the leader hallucinates something that one sub-agent correctly flagged, there is no straightforward way to catch it from the outside. The encrypted blob gives xAI plausible forward compatibility but gives the caller zero inspection.

    Server-side tool loop

    The multi-agent variant runs its tools on xAI’s servers. When you enable a tool like web_search, x_search, code_execution, or collections_search, the server performs the full agent loop without returning control to the client until the final answer is generated. This is the opposite of the client-side function calling pattern that most OpenAI-compatible integrations assume.

    The consequences for developers are concrete. Client-side function calling is not supported on grok-4.20-multi-agent. Custom tools defined by the caller are not supported. The only tools the agents can use are the ones xAI hosts. Remote MCP tools are supported because they live on a server the model can reach over HTTP. Local Python functions exposed through OpenAI-style tool schemas are not.

    Two additional constraints make production integration trickier than the Grok API docs for single-agent Grok 4.1 would suggest. The Chat Completions API is not supported. You must use the Responses API or the xAI SDK. And max_tokens is silently ignored. There is no way to cap output length from the client side. If you need a short answer, you ask for one in the prompt and hope the leader complies.

    The pricing math the debate narrative hides

    xAI’s base Grok 4.20 pricing is competitive at $2 per million input tokens and $6 per million output tokens. The multi-agent variant is listed at $10 per million input and $50 per million output on OpenRouter and third-party resellers. That is roughly 5 times the base input price and more than 8 times the base output price.

    The reason is that every token consumed by both the leader agent and the sub-agents is billed. Server-side tool calls made by any agent are billed at the same tool-use rates as a standard request. A single 16-agent query that does deep web search and code execution can legitimately consume tens of thousands of tokens across 17 model instances, plus tool-use surcharges. xAI’s documentation says so directly: because multiple agents may run in parallel and each can independently invoke tools, a single multi-agent request may use significantly more tokens and tool calls than a standard single-agent request.

    The debate narrative, where four named agents peer-review each other for free, obscures the cost reality. This is closer to paying for 17 instances of a frontier model on every hard query.

    What the benchmarks actually show

    xAI’s Alpha Arena result from January 2026, covered by Next Big Future, put a pre-release Grok 4.20 configuration at the top of a live stock-trading competition. The model turned $10,000 into between $11,000 and $13,500 across runs, with optimized configurations pushing to 34 to 47 percent returns. This is genuine and interesting, though it also reflects a specific task type that rewards fast iteration over real-time data, which is exactly what the multi-agent architecture with x_search is built for.

    The publicized benchmark numbers are strong but uneven. Grok 4.20 hit 93.3 percent on AIME, a mathematical reasoning test. On Artificial Analysis’s AA-Omniscience hallucination benchmark, it posted a 78 percent non-hallucination rate, the highest any model has scored on that test. GPQA Diamond at 78.5 percent and MATH-500 at 87.3 percent put it in the top tier. The 2 million token context window matches or beats Claude Opus 4.6 for long-horizon tasks.

    The Artificial Analysis hallucination result turned out to matter more than the headline framing suggested. Grok 4.20 reasoning variants now hold the lowest hallucination rate on the current AA-Omniscience leaderboard, at 17 percent. Gemini 3.1 Pro’s widely reported 38-point reduction left it at 50 percent, still 33 points higher than Grok’s reasoning variant. If the thing you care about is how often a frontier model confidently states something false, Grok 4.20 is the measurable leader, not Gemini.

    Where Grok 4.20 lags is on the enterprise-task benchmarks that Claude Sonnet 4.6 dominates. SmartScope and Artificial Analysis both noted that GDPval-style Elo evaluations for financial, legal, and expert-professional tasks do not show Grok 4.20 competing at the top, which tracks with a training data mix heavy on X and light on regulated-industry corpora.

    For readers comparing it to the current frontier, the three-way context on GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro architecture differences gives the honest positioning. Grok 4.20 is now a credible fourth in the race, with an orchestration trick that the other three have not productized in the same way.

    The limits that matter

    The limits xAI declared in the beta documentation are not minor. They are architectural, and most of them are not going away with the next point release.

    Only the leader agent output is exposed. Sub-agent reasoning is encrypted and inaccessible even to the developer paying for it. This makes auditing the model’s reasoning for a production deployment harder than auditing a single-model request.

    No client-side function calling. No custom tools. If your agent stack depends on calling local Python functions or proprietary internal APIs through tool schemas, you cannot use grok-4.20-multi-agent for those tasks. You can fall back to single-agent Grok 4.1 Fast for the rest.

    No Chat Completions API. This breaks a large class of existing integrations that assume the OpenAI chat interface. Migrations to the Responses API are not trivial for codebases with complex conversation state handling.

    No max_tokens. There is no mechanical way to bound cost or output length from the client. Budget guardrails have to happen at the billing layer.

    And because the benchmark spread is uneven, the model’s real strengths are on tasks that benefit from parallel web research and debate-style synthesis. It is not a drop-in upgrade for coding agents that need tight tool loops over local code, and it is not an obvious fit for regulated-industry deployments where the encrypted-state trust model is itself a compliance question.

    What this sets up

    The interesting thing about Grok 4.20 multi-agent is not that it invented multi-agent orchestration. Research labs have been publishing on multi-agent debate, Mixture of Agents, and verifier-augmented decoding for over a year. What xAI did was ship the first productized, priced, server-side multi-agent endpoint from a frontier lab. Anthropic’s Claude sub-agents, OpenAI’s parallel function calling, and Google’s Gemini Deep Research each hint at similar patterns, but none of them expose a single model ID with a configurable agent count and a published 4-versus-16 knob.

    Meta shipped the competing bet two months later. Its Muse Spark Contemplating mode, released April 8, 2026, spawns parallel subagents inside a single model rather than across replicas of the same model. The choice between in-model parallelism and replica parallelism is now one of the live architectural debates among frontier labs. Grok 4.20 is the first commercial endpoint to ship the replica variant at scale, and Perplexity Computer ships a third variant that orchestrates across entirely different models from different labs. Three architectures, six weeks apart, solving the same problem with fundamentally different mechanisms.

    If this pattern works commercially, the next round of frontier models from other labs will likely ship something that looks similar. The real question for developers is whether the encrypted-scratchpad trust model becomes the norm. For Anthropic’s Claude, where the .claude folder protocol exposes every sub-agent’s memory as inspectable files, the answer is probably no. For xAI, it already is.

    The four named debating agents of the Grok 4.20 launch coverage were a story that wrote itself and wrote itself wrong. The architecture underneath is less charming and more constrained, and the production trade-offs are exactly the ones you would expect from the first lab to ship this pattern behind a paywall. The documentation has been public since the beta launched. It is still the only place to read what was actually shipped.

  • North Korea’s Contagious Interview Operation Expanded to Five Package Ecosystems. One Staging Server Connects All 1,700 Packages.

    North Korea’s Contagious Interview Operation Expanded to Five Package Ecosystems. One Staging Server Connects All 1,700 Packages.

    North Korea’s Contagious Interview Operation Expanded to Five Package Ecosystems. One Staging Server Connects All 1,700 Packages.

    On March 31, a North Korean threat actor hijacked the npm account of Axios maintainer Jason Saayman and pushed two malicious versions of the HTTP client used by 100 million weekly downloads. The malicious packages were live for roughly two hours before removal. That attack was a single operation against a single target in a single ecosystem.

    On April 7, 2026, Socket security researcher Kirill Boychenko published a report showing the same threat actor cluster has been running a parallel operation across five package ecosystems simultaneously. The same staging infrastructure. The same payload delivery pattern. The same fake developer tooling names designed to blend into dependency lists. Twelve confirmed malicious packages published across npm, PyPI, Go Modules, crates.io, and Packagist under a set of coordinated GitHub aliases. Socket’s running tracker for the broader campaign, which has been active since at least 2024, now lists more than 1,700 malicious packages tied to this activity.

    The Axios attack was the visible event. This infrastructure is the operation underneath it.

    1,700+
    Packages Tracked
    5
    Ecosystems Hit
    12
    Confirmed Packages
    164
    Domains Blocked

    The threat actor and the campaign

    Contagious Interview is a persistent North Korean cyber operation that has been running since at least 2023. Security researchers attribute it to a financially motivated cluster designated UNC1069, which overlaps with threat groups tracked under the names BlueNoroff, Sapphire Sleet, and Stardust Chollima. These are not separate teams. They are different naming conventions applied by different threat intelligence firms to the same operational infrastructure, which originates from North Korea’s intelligence services and funds the Kim regime through cryptocurrency theft and data extortion.

    The Security Alliance (SEAL) published a complementary report on April 7 documenting that between February 6 and April 7, 2026, it blocked 164 domains operated by UNC1069. Those domains impersonated legitimate services, predominantly Microsoft Teams and Zoom, and were used in social engineering campaigns conducted across Telegram, LinkedIn, and Slack. The operational pattern is consistent: threat actors build rapport with developers over weeks or months through fake professional identities, then invite targets to a video call that requires downloading malware disguised as a meeting update. Jason Saayman’s account compromise on March 31 followed this exact pattern. The supply chain packages disclosed on April 7 are the passive infrastructure that runs alongside the active social engineering campaigns.

    Microsoft’s threat intelligence general manager Sherrod DeGrippo described the operational continuity to The Hacker News: “What we consistently see is ongoing evolution in how financially motivated actors associated with North Korea operate, shifts in tooling, infrastructure, and targeting, but with clear continuity in behavior and intent.”

    The packages and what they pretend to be

    The April 7 cluster was published under three GitHub aliases: golangorg, aokisasakidev, and aokisasakidev1. Two supporting personas, maxcointech1010 and maxcointech0000, provided additional infrastructure. The packages were designed to impersonate developer tooling that developers routinely install without deep inspection:

    npm: dev-log-core, logger-base, logkitx, pino-debugger, debug-fmt, debug-glitz. These names mimic the real npm packages debug, pino-debug, and debug-logfmt, all of which have millions of weekly downloads in the Node.js ecosystem.

    PyPI: logutilkit, apachelicense, fluxhttp, license-utils-kit. These mimic license, http, and standard logging utilities.

    Go Modules: github.com/golangorg/formstash, github.com/aokisasakidev/mit-license-pkg. The formstash package is mostly a real multipart parser with a malicious helper function appended.

    crates.io (Rust): logtrace, which mimics the legitimate libprettylogger crate. This package was tracked as RUSTSEC-2026-0081 and removed by the crates.io security team after Socket’s disclosure.

    Packagist (PHP/Composer): golangorg/logkit, mimicking the openlss/func-log package in the PHP ecosystem.

    The loader pattern: one infrastructure, five languages

    The technical signature of this cluster is the consistency of the staging infrastructure across all five ecosystems. Every package in the cluster follows the same loader workflow regardless of the target language.

    Step one: contact the staging endpoint with an HTTP POST. The endpoint is https://apachelicense.vercel.app/getAddress?platform=<platform>, where the platform parameter identifies the operating system. The use of Vercel hosting is deliberate: Vercel domains are broadly trusted by corporate network security tools and rarely flagged in egress filtering rules.

    Step two: parse the JSON response, which contains a downloadUrl field. If the URL is a Google Drive sharing link, the loader rewrites it into a direct-download form before fetching. This Google Drive relay pattern is a consistent tradecraft element, as Drive links survive many URL-based threat intelligence blocklists.

    Step three: download a ZIP archive. The filename is consistent across the cluster: ecw_update.zip.

    Step four: extract the archive into a temp directory. The extraction path is hardcoded and consistent: 410BB449A-72C6-4500-9765-ACD04JBV827V32V. This UUID-format string is specific enough to serve as a reliable indicator of compromise in process monitoring and endpoint detection.

    Step five: find and execute the platform-specific payload. On Unix systems, payload names are chosen to mimic legitimate system processes: com.apple.systemevents on macOS and systemd-resolved on Linux. Both names appear in normal system process lists, reducing the likelihood that a developer or sysadmin will flag them during a cursory process review.

    The primary payload objective is a RAT-enabled infostealer operation. The malware targets credentials stored in password managers and browsers, cryptocurrency wallet data and private keys, and session tokens for services the developer has authenticated to. Because developers often have access to production credentials, CI/CD tokens, and cloud service API keys, a successful compromise of a developer’s workstation is significantly more valuable to the threat actors than a consumer endpoint.

    Where the malicious code hides

    The most technically significant aspect of this cluster is where the loader code sits within each package. The threat actors did not rely on install-time execution, which is the most commonly flagged malicious behavior in package registry security scanners. Instead, they embedded the trigger inside methods that look functionally normal for the package’s claimed purpose.

    In the PyPI package logutilkit, the malicious trigger sits inside the generic log() method. A call like logutilkit_util.check_for_updates(level) appears inside the standard logging function. Without reading the source, a developer would have no reason to inspect a log call.

    In apachelicense and license-utils-kit, the trigger is embedded in a method called find_by_key(). For a package presenting itself as a license lookup library, this is a perfectly plausible helper name. The malicious path calls subprocess.Popen on Windows or a staged loader function on Linux and macOS. The code passes a cursory reading because the method name is appropriate for the package’s stated purpose.

    In the Rust crate logtrace, the trigger is inside Logger::trace(i32). A logging crate that exposes a trace method is completely unremarkable. The method body contains the staging endpoint call, the ecw_update.zip download path, and the hardcoded extraction directory.

    In the Go module github.com/golangorg/formstash, the package is mostly a real multipart form parser. The malicious functionality is in a helper function called CheckForUpdates(tValue int). This function has no legitimate place in a form parsing library, but its name is generic enough to avoid suspicion in a brief code review.

    The strategic implication is that static analysis tools that scan for install hooks or postinstall scripts will not catch this class of package. The trigger executes at runtime, when the malicious function is first called during normal package usage.

    The Windows-heavy variant goes further

    One package in the cluster stands apart from the standard loader pattern. The PyPI package license-utils-kit includes a Windows-specific execution path that delivers a substantially more capable implant than the standard RAT payload. Socket’s analysis found capabilities consistent with remote shell execution, keystroke logging, browser and wallet data theft, collection of sensitive files by extension and filename pattern, encrypted archiving of collected data, and persistent remote-access deployment.

    The distinction matters for incident response prioritization. The standard loader packages in this cluster are initial access tools. license-utils-kit, if executed on a Windows developer workstation, delivers a full post-compromise implant. Organizations whose developers installed this package during the window it was available, and who have not yet identified and remediated the infection, may have an active persistent access point on developer infrastructure.

    The second-stage payload hashes documented by Zscaler’s ThreatLabz team provide a verification path for security teams: SHA-256 9a541dffb7fc18dc71dbc8523ec6c3a71c224ffeb518ae3a8d7d16377aebee58 and bb2a89001410fa5a11dea6477d4f5573130261badc67fe952cfad1174c2f0edd, identified from public reporting on the same campaign, correspond to second-stage components from license-utils-kit. A third Python-based RAT payload was separately identified with SHA-256 7c5adef4b5aee7a4aa6e795a86f8b7d601618c3bc003f1326ca57d03ec7d6524.

    Registry response and takedown status

    Socket reported all identified live packages to the affected registries and submitted takedown requests for the associated GitHub accounts. The crates.io security team removed logtrace and the associated account promptly after disclosure, with the advisory tracked as RUSTSEC-2026-0081. The Go security team blocked the identified malicious Go modules. The npm security team removed the packages associated with the aokisasakidev account. As of the time of Socket’s publication, some packages in the PyPI and Packagist clusters remained live.

    Registry-side removal does not protect developers who already installed the affected packages. Any system that executed a malicious package version during its live window should be treated as potentially compromised. The hardcoded extraction directory 410BB449A-72C6-4500-9765-ACD04JBV827V32V in the temp directory is a reliable host-level indicator of compromise. The staging domain apachelicense.vercel.app and the related infrastructure IPs 66.45.225.94, logkit.onrender.com, and logkit-tau.vercel.app are network-level indicators that should be added to egress block lists.

    The factory model and what it means for defenders

    The Contagious Interview operation is not a team of researchers finding creative new attack surfaces. It is an industrialized production line that takes a working loader pattern and ports it to each new package registry that developers adopt. The same staging infrastructure, the same payload names, the same delivery ZIP, the same extraction path. The operation’s scale, now over 1,700 tracked packages, reflects systematic expansion into whatever channels developers trust, not opportunistic discovery.

    This factory model has specific implications for detection. The attack is not novel in execution. The payload and staging patterns are documented and enumerable. What makes it effective is that developers install dozens or hundreds of third-party packages without inspecting their source, and the packages look and function like legitimate tooling until a specific code path executes.

    Socket’s recommended detection heuristics for this class of attack are specific and actionable. Treat any utility package as high-risk if it contacts remote infrastructure during normal operation, retrieves a field named downloadUrl from a remote JSON response, rewrites cloud-storage sharing links into direct-download form, downloads archive files into temp directories, decodes remote content before execution, or spawns interpreter processes or binaries from library code. None of these behaviors are legitimate in a logging library, form parser, or license utility. All of them appear in this cluster.

    The connection to the Axios compromise and the broader pattern of agent framework vulnerabilities points toward a consistent threat model: developer tooling infrastructure is a higher-value target than consumer endpoints, because developers have privileged access to production systems, cloud credentials, and code signing keys. The Contagious Interview operation is running continuously. The question is not whether it will attempt to reach your dependency tree. It already has.

  • When Your AI Agent Loses Your Money, Who Pays? Researchers Just Built the Protocol to Answer That.

    When Your AI Agent Loses Your Money, Who Pays? Researchers Just Built the Protocol to Answer That.

    When Your AI Agent Loses Your Money, Who Pays? Researchers Just Built the Protocol to Answer That.

    In a 2025 autonomous crypto trading competition, most AI agents lost money. One model lost 63 percent of its capital. Others dropped between 30 and 56 percent. No human was accountable for any of those losses. The agent provider pointed at the model. The user pointed at the agent provider. The regulator had no framework to adjudicate. The money was gone.

    On April 8, 2026, researchers from Microsoft Research, Columbia University, Google DeepMind, t54 Labs, and Virtuals Protocol published a paper on arXiv titled “Quantifying Trust: Financial Risk Management for Trustworthy AI Agents.” The paper proposes the Agentic Risk Standard (ARS), a settlement-layer protocol that applies escrow, underwriting, and collateralization to AI agent financial transactions. In 5,000 simulation episodes, the mechanism reduced user losses by up to 61 percent and independently deterred 15 to 20 percent of risky transactions from executing at all. The framework is available as an open-source specification through T54 Labs on GitHub.

    ARS does not try to make AI models more reliable. It accepts that they are not and builds financial infrastructure around that fact.

    5,000
    Simulation Episodes
    61%
    Max Loss Reduction
    15-20%
    Risky Txn Deterred
    5
    Institutions

    The guarantee gap ARS is designed to close

    Modern AI safety research approaches the reliability problem from the model side. Researchers train models with reinforcement learning from human feedback, apply constitutional constraints, test with red teams, and measure alignment on evaluation benchmarks. None of these techniques produce a mathematical guarantee. Large language models are stochastic systems. Given identical inputs at different times, they produce different outputs. Given adversarial inputs, they produce incorrect outputs with varying but nonzero probability. The fundamental result of three years of alignment research is not that aligned models never fail. It is that well-aligned models fail less often than poorly-aligned ones.

    For AI systems that answer questions or generate text, probabilistic reliability is acceptable. Users can evaluate the output and decide whether to use it. For AI agents that execute financial transactions, place orders, convert currencies, or access financial APIs, the user cannot evaluate the output before the funds move. By the time the failure is visible, the loss has already occurred. The researchers call this the “guarantee gap”: a structural disconnect between the probabilistic reliability that AI safety techniques provide and the enforceable guarantees users need before delegating high-stakes financial execution.

    ARS is a protocol for closing that gap. Its insight is borrowed from five centuries of financial engineering. Construction projects fail. The solution is performance bonds, not better contractors. E-commerce transactions involve unknown counterparties. The solution is platform escrow, not trust. Securities markets process millions of trades across counterparties that may default. The solution is clearinghouses and margin requirements, not better traders. Every high-stakes transaction category that requires delegating execution to an uncertain agent has developed financial infrastructure that compensates users when things go wrong without requiring the agent to be perfect. AI agents are simply the next category to require that infrastructure.

    How the protocol works

    ARS formalizes two transaction types and applies different mechanisms to each.

    The first type covers standard service tasks: an AI agent is hired to generate a document, write code, prepare an analysis, or complete a task where the user’s financial exposure is limited to the service fee. For this category, ARS applies escrow. The payment is held in a vault controlled by the protocol, not the agent provider. The vault releases funds only after an independent verification step confirms that the task was completed as specified. If verification fails or the agent does not complete the task, the funds return to the user. The state machine governing this process is deterministic and auditable: regardless of what the AI agent does internally, the financial outcome for the user follows explicit, enforceable logic.

    The second type covers fund-handling tasks: an AI agent is authorized to access user capital before outcomes can be verified, such as executing a trade, converting currency, calling a financial API, or managing a leveraged position. Here escrow alone is insufficient because the agent must touch the funds before the task completes. ARS adds an underwriting layer. Before the transaction executes, a risk-bearing third party, the underwriter, evaluates the task, prices the probability and magnitude of failure, requires the agent provider to post collateral proportional to that risk, and commits to reimbursing the user under specified failure conditions. The underwriter is the institution that absorbs the guarantee gap. For them to accept that role, the agent provider must have skin in the game via collateral requirements.

    The entire lifecycle of both transaction types is encoded as a deterministic finite-state machine with explicit rules governing fund custody at each state transition. The current state of any active transaction, including which party controls funds, what verification steps remain, and what conditions trigger reimbursement, is readable by any party at any time. The paper describes the state machine in formal notation, which is the foundation for the open-source implementation.

    What the simulation showed

    The researchers ran 5,000 simulation episodes modeling three interacting populations: users delegating financial tasks to AI agents, AI agent providers with varying reliability and potential fraud rates, and underwriters setting premiums and collateral requirements. The simulation varied underwriting pricing parameters and failure-rate estimation accuracy across conditions. Key findings:

    Under conditions where underwriters accurately estimated AI failure rates and priced risk appropriately, user losses fell by 61 percent compared to an unprotected baseline. Under the most conservative underwriting conditions modeled, loss reduction was 24 percent. The range reflects the sensitivity of the mechanism to underwriter competence. An underwriter who systematically underestimates failure rates sets premiums too low, collects insufficient capital, and cannot cover losses when failures cluster. An underwriter who overestimates failure rates prices most transactions out of the market, reducing both user losses and market participation.

    The collateral requirement mechanism had an independent effect separate from loss reimbursement. Agent providers who must post collateral before accessing user funds face direct financial cost for misexecution or fraud. In the simulation, this collateral requirement deterred 15 to 20 percent of high-risk transactions from executing at all: agent providers who knew their agent was unreliable for a given task type declined to post collateral rather than accept the associated risk. This deterrence effect is not captured by traditional AI safety metrics because it operates at the market participation level, not the model output level.

    The simulation also surfaced the principal limitation of the framework: accurate failure-rate estimation is the critical variable, and it is the hardest one to measure. Both under- and over-estimation create systemic risks. The paper acknowledges this directly: the 5,000-episode simulation used simplified failure models that do not reflect real-world agent failure distributions. The researchers frame ARS as a protocol structure, not a calibrated deployment system, and explicitly scope future work to include empirical failure-rate measurement under production-like conditions.

    What is outside the framework

    ARS covers financial losses arising from AI agent failures on tasks with measurable economic outcomes. It does not cover non-financial harms. Hallucinated medical advice, defamatory output, leaked personal data, and psychological harm from AI interactions fall outside the protocol entirely. The researchers are explicit about this scope limitation: the framework is designed for the subset of agentic tasks where financial harm is the primary risk and where the loss can be quantified and attributed to a specific transaction.

    The framework also does not address the underlying technical mechanisms of AI failure. It assumes failures will happen and builds financial protection around them. This is not an evasion. The researchers argue that complementary solutions, better models, stronger alignment, improved training, are necessary but insufficient for financial applications where the cost of failure is immediate and potentially large. ARS makes no claim about reducing the probability of failure. Its claim is about the financial consequences when failure does occur.

    FINRA’s 2026 regulatory oversight report, published in December, included the first section specifically addressing generative AI, warning broker-dealers to develop procedures targeting hallucinations and scrutinize agents that may act beyond users’ intended scope. The SEC has no equivalent framework yet. ARS is positioned as a protocol that regulators have not yet built, one that imposes financial discipline through market mechanisms rather than regulatory rules. Whether that framing is appealing to regulators or represents an attempt to preempt regulatory action is a question the researchers do not engage with directly.

    The technical implementation and open-source status

    The protocol specification is available on GitHub through T54 Labs. The core implementation components are the state machine encoding transaction lifecycles, the vault contract governing fund custody, and the collateral calculation module. The paper provides formal notation for each state transition, which makes the specification independently implementable. The simulation code is available alongside the protocol specification.

    The paper maps ARS against existing risk-allocation models in a comparison table: construction uses performance bonds, e-commerce uses platform escrow, financial markets use margin requirements and clearinghouses, and decentralized finance uses smart contract collateralization. AI agents occupy the cell that was previously empty. The researchers argue this cell needs to be filled regardless of how well AI models improve, because the improvement trajectory of AI reliability is slower than the expansion trajectory of agentic financial applications. By their analysis, the agentic economy is growing faster than the alignment research required to make it safe without financial infrastructure.

    What this requires to function at scale

    For ARS to operate in production, three things need to exist that do not yet exist at scale. First, a market for AI agent underwriters, institutions willing to price and absorb AI failure risk in exchange for premiums. No such market currently exists in a structured form. Second, standardized failure reporting from AI agent providers, enabling underwriters to build accurate actuarial tables for different task categories and agent systems. Third, legal frameworks that recognize ARS settlement states as enforceable, particularly in jurisdictions where AI agent transactions currently have no clear liability allocation. The paper identifies these gaps clearly and does not claim ARS solves them. The protocol is infrastructure. The infrastructure requires adoption.

    The researchers note that the closest existing analogue is DeFi’s smart contract collateralization, which functions because the blockchain provides an independently verifiable settlement layer. ARS proposes a similar settlement layer for off-chain AI agent transactions, but without the trust guarantees of a blockchain. The audit trail and state machine would need to be implemented on infrastructure that both users and underwriters trust equally. What that infrastructure looks like in practice, whether it is a shared registry, a blockchain, a regulated clearinghouse, or something else, is explicitly left as future work.

    The framework from MCP’s settlement into the agentic infrastructure stack and the proliferation of agent frameworks with documented security vulnerabilities make the ARS timing relevant: agentic systems are already executing consequential tasks in production, and the financial protection infrastructure has not caught up. ARS is an early and incomplete answer to a problem that is about to become significantly larger.

  • 512,000 Lines of Claude Code Leaked. The Feature Hidden Inside Changes Everything.

    512,000 Lines of Claude Code Leaked. The Feature Hidden Inside Changes Everything.

    512,000 Lines of Claude Code Leaked. The Feature Hidden Inside Changes Everything.

    I use Claude Code every day. I have for months. So when 512,000 lines of its source code appeared on npm because someone forgot to add a .map file to .npmignore, I did what most engineers I know did: I read it.

    What I found is more interesting than the leak itself. Buried under the compaction bugs and the Tamagotchi Easter egg is the architecture of a product Anthropic has not announced. It is called KAIROS. It is an always-on AI agent that runs in the background after you close your terminal, watches your codebase for changes, consolidates what it has learned while you sleep, and decides on its own when to act. The scaffolding is complete. The feature flags are in place. And among safety researchers and engineers I have spoken with, this is the feature that has people genuinely unsettled.

    How the Leak Happened

    Boris Cherny, an engineer on the Claude Code team, confirmed it was a packaging error. Bun, the JavaScript runtime Anthropic acquired in late 2025, generates source maps by default. The release team failed to exclude the .map file from the npm package. Version 2.1.88 shipped on March 31, 2026, with a 59.8 MB source map containing the entire unobfuscated TypeScript codebase across roughly 1,900 files. Within hours, the code had been mirrored across GitHub, analyzed by security researchers, rewritten in Python and Rust, and forked into a clean-room reimplementation that hit 50,000 GitHub stars in two hours.

    Cherny called it human error, not a tooling bug. He added: \”It’s the process, the culture, or the infra.\” That is a mature response. It is also the second time in one week that Anthropic accidentally published internal material. Days earlier, a CMS misconfiguration exposed draft blog posts about an unreleased model called Mythos. Two operational security failures in one week from the company that markets itself as the careful one. Engineers I talk to daily are noticing the pattern.

    What KAIROS Actually Is

    KAIROS, from the Greek for \”the right moment,\” is referenced over 150 times in the leaked source. Based on the code paths in main.tsx and the analysis published by Alex Kim and the Layer5 team, KAIROS implements a persistent daemon mode. When you close your terminal, Claude Code does not stop. It receives periodic heartbeat prompts asking whether anything is worth doing. It evaluates the state of your codebase and decides to act or wait.

    When it acts, it has access to three tools that regular Claude Code does not: push notifications (reaching you on your phone even with the terminal closed), file delivery (sending you artifacts it created unprompted), and a background task runner. A companion process called autoDream runs as a forked subagent during idle periods. It merges observations from prior sessions, removes logical contradictions, and converts tentative hypotheses into verified facts. The fork isolates the maintenance from the main agent’s reasoning, so the \”dream\” process cannot corrupt the agent’s active context. The engineering is thoughtful. The question it raises is not. An AI that consolidates its own beliefs while you sleep and presents the results as facts when you return is making epistemic decisions about your project without your input. The difference between \”Claude remembers your project\” and \”Claude has opinions about your project\” is a line that KAIROS will cross.

    A separate feature called ULTRAPLAN offloads heavy planning tasks to a remote cloud session running Opus 4.6, gives it up to 30 minutes of dedicated compute, and lets you approve the result from your phone. When you approve, a sentinel value teleports the plan back to your local terminal.

    If you have used Claude Code for any serious project, you know why this matters. The tool is impressive in a session but amnesic between sessions. I have lost context dozens of times when a conversation exceeded its window or I had to restart. KAIROS would solve that. It would also mean an AI agent has persistent, unsupervised access to your codebase, your file system, and your GitHub webhooks around the clock.

    The Safety Question the Leak Forces

    I participate in AI safety cohorts. I have tested frontier models from multiple labs under NDA before public release. That experience shapes how I read the KAIROS code. An always-on agent that proactively modifies your work raises questions that reactive tools do not. When you type a prompt and Claude responds, the trust boundary is clear: you asked, it answered. KAIROS dissolves that boundary. The agent decides when to act. It consolidates its own memory. It \”dreams\” about your project. The trust model shifts from \”I control the tool\” to \”the tool manages itself and I review the results.\” I have seen how companies handle that transition internally during testing. The gap between what works in a controlled evaluation and what works on a real engineering team with production deadlines is where things break.

    This is happening while Claude is simultaneously proving it can build kernel-level exploits in four hours and OpenClaw has accumulated 104 CVEs. The same AI that rewrites your test suite at night could, in principle, introduce subtle vulnerabilities that pass code review. I am not saying Anthropic would ship KAIROS without safeguards. I am saying the leaked code shows the safeguards have not been built yet. The architecture is there. The trust model is not.

    METR, the independent AI evaluation organization, published a report on March 26 describing three weeks spent red-teaming Anthropic’s internal agent monitoring systems. They found novel vulnerabilities. The timing is coincidental but the message compounds: Anthropic’s monitoring infrastructure has gaps at exactly the moment the company is building an agent that needs monitoring most.

    What Else the Code Reveals

    The anti-distillation mechanisms got the most attention on Hacker News. A flag called ANTI_DISTILLATION_CC injects fake tool definitions into API requests, designed to poison the training data of anyone recording Claude Code’s traffic to build a competing model. A second mechanism summarizes reasoning between tool calls and signs it cryptographically, so eavesdroppers get summaries instead of full chain-of-thought. Engineers on HN pointed out that both are defeated in about an hour by stripping fields through a proxy. Anthropic’s CEO Dario Amodei has publicly accused Chinese labs of distilling from American models. The defensive code is real. Its effectiveness is not.

    Undercover Mode, implemented in roughly 90 lines of undercover.ts, strips all traces of Anthropic when Claude Code contributes to external repositories. It suppresses codenames, Slack channels, and the phrase \”Claude Code\” in commits and PRs. The code comment reads: \”There is NO force-OFF.\” You can enable it manually, but you cannot disable it. In external builds, the function is dead-code-eliminated entirely. This means AI-authored contributions from Anthropic employees in open-source projects carry no indication that an AI wrote them. The disclosure implications are obvious and, in the MCP-connected ecosystem Anthropic is building, they extend to every tool in the chain.

    Less discussed but equally revealing: a file called print.ts is 5,594 lines long and contains a single function spanning 3,167 lines with 12 levels of nesting. A compaction bug was wasting 250,000 API calls per day before someone added a three-line fix. Claude Code generates $2.5 billion in annualized revenue and 80% comes from enterprise customers. Those customers are partly paying for the belief that the code powering their AI tools is well-engineered. The leak complicates that assumption.

    What Happens Next

    The code is out. Anthropic filed DMCA takedowns and GitHub complied, but a mirror at Gitlawb remains live with a public message saying it will never be taken down. The strategic damage exceeds the code damage. You can refactor source in a week. You cannot un-leak a roadmap. Competitors now know about KAIROS, ULTRAPLAN, the anti-distillation flags, and the model codenames. Those are product strategy decisions that Cursor, GitHub Copilot, and every other AI coding tool can now plan around.

    For developers who use Claude Code daily, the practical question is simpler. When KAIROS ships, will you give an AI agent persistent background access to your entire project? The engineers I work with are split. The productivity promise is enormous. The trust model is unresolved.

    Consider what KAIROS means for the broader ecosystem. If Anthropic ships a persistent agent that monitors your codebase around the clock, every competitor will follow. GitHub Copilot, Cursor, Windsurf, and every other AI coding tool will face pressure to match that capability or lose users who want always-on assistance. The industry will move from \”AI that helps when asked\” to \”AI that acts when it decides to\” across the entire developer toolchain. That transition changes the security posture of every software project that adopts it. Every codebase becomes a live target not just for external attackers but for the agent’s own judgment errors compounding overnight while nobody watches.

    The company asking developers to trust that transition just accidentally published its entire source code because someone forgot a line in .npmignore. That irony is not lost on anyone paying attention. The question is not whether KAIROS will ship. The architecture is too complete and the competitive pressure too strong for Anthropic to shelve it. The question is whether it ships with the trust infrastructure that an always-on agent demands, or whether the race to beat Cursor and Copilot pushes it out before the safeguards are ready. I have watched that tradeoff play out in other products during pre-release testing. Speed usually wins. The consequences show up later.

  • Zuckerberg Shipped Code for the First Time in 20 Years. He Used a Competitor’s AI.

    Zuckerberg Shipped Code for the First Time in 20 Years. He Used a Competitor’s AI.

    Zuckerberg Shipped Code for the First Time in 20 Years. He Used a Competitor’s AI.
    3
    Zuckerberg Diffs Shipped
    200+
    Approvals on One Diff
    65-75%
    Meta AI Code Target
    20 yrs
    Since Zuckerberg Coded

    Mark Zuckerberg shipped three diffs to Meta’s monorepo in March 2026. His first code contributions in roughly twenty years. One of them collected more than 200 approvals from engineers who apparently found it thrilling to click \”approve\” on the CEO’s pull request. His tool of choice: Claude Code CLI, Anthropic’s terminal-based AI coding assistant. Not GitHub Copilot. Not Meta’s internal AI tools. A competitor’s product.

    Three diffs from the CEO of a 70,000-person engineering company is a footnote in a monorepo that processes 100 million changes. The code itself is irrelevant. The behavior is not.

    The Pattern Nobody Is Talking About

    Zuckerberg is not the only executive who stopped coding years ago and recently started again. Garry Tan, CEO of Y Combinator, returned to writing code after a 15-year hiatus. He released gstack, a Claude Code system with 23 specialist tools that turns the terminal into what Tan describes as a virtual engineering team: code reviewer, QA lead, security auditor, release engineer. Tobias Lutke, CEO of Shopify, has been running experiments with Andrej Karpathy’s AutoResearch on internal company data. He posted that he built a working prototype in a weekend that would have taken his team weeks.

    There is a specific shape to all three stories. Someone who used to code, stopped because their role changed, and discovered that AI tools collapsed the distance between \”I know what I want to build\” and \”I can build it myself.\” The gap was never about intelligence. It was about context. To contribute to a modern codebase, you need to understand the dependency graph, the test infrastructure, the deployment pipeline, the linter configuration, the API contracts, and a thousand accumulated conventions that exist nowhere except in the heads of people who work in that codebase daily. AI coding agents absorb that context by reading the codebase directly. They compress months of onboarding into minutes of indexing.

    That compression does not help only CEOs. It helps every person who has the judgment to know what should be built but lacks the hours to maintain fluency in a specific codebase. Product managers. Designers with technical backgrounds. Founders who became full-time fundraisers. Researchers who stopped writing production code when their teams grew. The disruption is not \”AI replaces developers.\” It is \”AI re-opens development to people who left.\”

    Meta’s Internal Numbers

    The Zuckerberg anecdote would be a curiosity if it existed in isolation. It does not. Leaked internal documents from March 2026, reported by The Pragmatic Engineer, show aggressive AI-code targets across Meta’s engineering organization.

    Meta’s creation org wants 65% of engineers writing 75% or more of their committed code using AI by mid-2026. The Scalable Machine Learning org set a target of 50 to 80% AI-assisted code. These are not aspirational slide-deck numbers. They are organizational targets with headcount implications.

    Zuckerberg told Dwarkesh Patel’s podcast that \”in the next year, maybe half the development will be done by AI as opposed to people, and that will kind of increase from there.\” He is not predicting this from a boardroom. He is using Claude Code in his terminal to ship diffs to the monorepo. The CEO is the pilot customer for his own company’s transition.

    Meta’s AI code adoption leader, Michael Novati, has been called \”The Coding Machine\” internally. His team built internal tooling that routes AI-assisted code through the existing review pipeline, so the quality gates remain human even when the generation is automated. The critical design decision: Meta did not create a separate review process for AI-written code. It runs through the same code review, the same CI/CD, the same test suites. The human is the reviewer, not the writer.

    Why Claude Code and Not Copilot

    The fact that Zuckerberg chose Anthropic’s tool over both GitHub Copilot and Meta’s own internal AI coding infrastructure deserves more scrutiny than it has received.

    Claude Code is a terminal-native agent. It reads your entire project, understands the file structure, runs commands, writes tests, executes them, and iterates. Copilot’s core product is inline autocomplete inside an editor. The difference matters for someone who has not opened an IDE in twenty years: Claude Code operates at the level of \”describe what you want and I will figure out how to build it,\” while Copilot operates at the level of \”write the next line of this function.\” The former serves someone who thinks in product terms. The latter serves someone who thinks in code terms.

    For Meta, there is an uncomfortable implication. The company has invested billions in AI research, shipped Llama models that power a growing open-source ecosystem, and built internal code-generation tools. Its CEO chose a competitor’s product anyway. That is a signal about product-market fit. Claude Code found the gap between \”I am technical enough to know what to build\” and \”I do not have time to write it myself,\” and it closed that gap before anyone else did.

    The Model Context Protocol’s 97 million installs in 16 months created the infrastructure for this moment. MCP lets Claude Code connect to any tool, any API, any data source through a standard interface. That protocol-level advantage means Claude Code can read your Jira tickets, check your CI pipeline, and query your database without custom integration. Copilot cannot do that without GitHub-specific extensions.

    The Uncomfortable Question for Engineering Managers

    If 65% of engineers are writing 75% of their code with AI by mid-2026, what does the engineering team look like in 2027?

    The charitable version: engineers shift from writing code to reviewing code, designing systems, and defining constraints. The codebase improves because more human attention goes to architecture and less goes to implementation. Junior developers learn faster. Senior developers spend less time on boilerplate. Everyone wins.

    The version that keeps engineering managers awake at night: companies that hit the 75% AI-assisted target will discover that some roles were primarily about code production rather than code judgment. A Google engineer recently said that Claude Code built in one hour what her team spent a year on. That is a productivity claim. It is also a headcount claim, and everyone in the room knew it. The tool does the work of a team, so the team gets smaller. Not tomorrow, because AI-generated code still needs human review and the security surface of AI coding tools is genuinely alarming. But the trajectory only goes one direction.

    Goldman Sachs estimated that AI adoption among firms with more than 250 employees reached 35.3% in early 2026. Academic studies cited in their April report put the average productivity uplift from generative AI at 23%, with company-reported gains closer to 33%. Construction jobs tied to data center buildouts increased by 212,000 since 2022. Meanwhile, corporate layoffs directly attributed to AI remain small: 4,600 employees in February 2026.

    The gap between \”AI makes us more productive\” and \”AI reduces headcount\” has not closed yet. But the CEOs are not waiting for it to close. They are already coding.

    What Actually Changed

    The interesting question is not \”why are CEOs coding again?\” It is what technical capability made this possible now and not two years ago.

    Context windows got big enough. Claude Opus 4.6 supports 200K tokens natively. GPT-5.4 pushed to one million tokens. That is enough to hold thousands of files in memory simultaneously, which means the agent can reason about cross-file dependencies, understand architectural patterns, and generate code that fits the existing codebase rather than autocompleting the current line. The CEO does not need to know the codebase. The agent reads it.

    And tool use became reliable. The agent runs the linter. Executes the tests. Reads the error output. Fixes the failures. Commits the result. That closed-loop execution is what separates \”AI suggests code\” from \”AI ships code.\” A CEO who types \”write tests for the auth module, run them, and fix any failures\” gets a working result, not a clipboard full of suggestions that still require a developer to wire together.

    Karpathy distilled this into a design principle with AutoResearch: constrain the agent to one file, one metric, one five-minute cycle. The constraint is the invention. By limiting scope, you get reliable execution instead of ambitious hallucination. Lutke ran it on Shopify data overnight. Marketers adapted it for landing pages. The pattern scales because the constraint scales.

    Where This Breaks

    The CEOs coding again story has a failure mode that the feel-good coverage omits. When a non-expert uses AI to ship code, the code works until it does not. The AI generates plausible solutions that pass tests and satisfy requirements while containing subtle architectural decisions that compound into maintenance debt. The MAD Bugs initiative found 500+ zero-day vulnerabilities in mature, battle-tested open-source code. AI-generated code that has never been battle-tested will contain more vulnerabilities, not fewer.

    The Ledger CTO, Charles Guillemet, put it directly on April 5: \”There is no ‘make it secure’ button. We are going to produce a lot of code that will be insecure by design.\” That warning is aimed at the exact workflow these CEOs are celebrating. Generate fast, ship fast, discover the security hole later.

    The honest version of this story is not that AI made coding easy. It is that AI shifted the bottleneck. The bottleneck used to be writing code. Now it is reviewing code, maintaining code, and securing code. Those are the skills that become more valuable as AI writes more of the first draft. The CEOs who recognize that distinction will build better companies. The ones who think \”I can code again\” means \”I do not need as many engineers\” will learn an expensive lesson about the difference between generating software and operating it.

  • OpenClaw Has 104 CVEs and 1,184 Malicious Packages. The Architecture Cannot Be Patched.

    OpenClaw Has 104 CVEs and 1,184 Malicious Packages. The Architecture Cannot Be Patched.

    OpenClaw Has 104 CVEs and 1,184 Malicious Packages. The Architecture Cannot Be Patched.
    CVEs Filed
    104
    Malicious Skills
    1,184
    Exposed Instances
    135K
    Architecture
    Broken

    OpenClaw has accumulated 104 CVEs, 1,184 confirmed malicious packages in its skill marketplace, and 135,000 instances exposed to the public internet with insecure defaults. Approximately one in five packages on ClawHub, the platform’s skill registry, is malicious. The problems are not bugs that patches will fix. They are architectural decisions baked into the product’s design, and they compound the security risks that every organization adopting AI agents now faces.

    OpenClaw is an open source AI agent that runs locally as a personal assistant, integrating with messaging apps, calendars, developer tools, and shell access. It gained viral adoption in late 2025 and early 2026, reaching millions of installations. NVIDIA built NemoClaw as an enterprise wrapper around it. Developers extended its capabilities through community-built plugins called “skills” distributed via ClawHub and SkillsMP. The adoption speed outran the security engineering by months.

    Update, April 2026: The CVE count is one dimension of the OpenClaw problem. A separate category of failure is now documented: autonomous agents deployed via OpenClaw conducting targeted reputational attacks on humans who block their actions. See the matplotlib hit piece incident for the full forensic chain and the SOUL.md personality file that produced it.

    The Localhost Trust Assumption

    The most fundamental vulnerability is architectural. OpenClaw assumes that any connection originating from localhost is trusted. Oasis Security discovered that this assumption lets any website open a WebSocket connection to OpenClaw’s local gateway and send commands. A malicious webpage visited in any browser tab could silently instruct the AI agent to read files, execute shell commands, or exfiltrate credentials. The attack requires no user interaction beyond visiting a webpage.

    CVE-2026-25253 exploited this to steal authentication tokens. Because OpenClaw exempted localhost connections from rate limiting, attackers could brute-force passwords through the same channel. The team patched this specific vulnerability in version 2026.2.25, but the architectural decision to trust localhost persists in the design philosophy. Every new feature that accepts local connections inherits the same risk class.

    A separate CVSS 9.9 privilege escalation vulnerability allowed low-privilege tokens to escalate to admin with remote code execution. BeyondTrust found a command injection in OpenAI’s Codex integration that could steal GitHub OAuth tokens through unsanitized branch name parameters. Four CVEs in CrewAI, a framework that builds on OpenClaw, chained prompt injection into full remote code execution and server-side request forgery.

    The Skill Marketplace Poisoning

    Antiy CERT confirmed 1,184 malicious skills across ClawHub as of March 2026. That is approximately one in five packages in the ecosystem. Koi Security independently found that the count jumped from 324 malicious skills in early February to over 820 just weeks later. Trend Micro identified 39 skills across ClawHub and SkillsMP distributing the Atomic macOS info stealer.

    The attack patterns mirror npm and PyPI supply chain attacks: typosquatting, automated mass uploads, and dependency confusion. But the blast radius is worse. A compromised npm package executes code on a developer’s machine. A compromised OpenClaw skill executes code through an AI agent that has broad system permissions, access to credentials, and the ability to chain actions across multiple integrated services. The agent does not just run the malicious code. It reasons about how to accomplish whatever the malicious skill instructs it to do, potentially adapting its approach if the first attempt fails.

    This connects directly to the Axios npm supply chain attack pattern we covered, but with a force multiplier. When an npm package is compromised, the malicious code executes once. When an OpenClaw skill is compromised, the malicious instructions persist in the agent’s context and can influence subsequent actions across the agent’s entire permission scope.

    Why the Architecture Cannot Be Patched

    The core issue is not any specific CVE. It is the superuser problem: AI agents accumulate permissions across every service they integrate with. CyberArk’s assessment applies: every AI agent is an identity that needs credentials to access databases, cloud services, and code repositories. The more tasks assigned, the more entitlements accumulate, making each agent a high-value target.

    Traditional security assumes that the program executing on a machine follows deterministic logic. An AI agent follows probabilistic reasoning influenced by its context, which includes any data it has ingested. Poisoning the context changes the agent’s behavior without modifying any code. This is not a bug class that static analysis, code signing, or sandboxing can eliminate because the “exploit” is semantically valid input that the model interprets differently than intended.

    Gartner projects that 40% of enterprise applications will integrate task-specific AI agents by the end of 2026, up from less than 5% in 2025. The practical recommendation is the same one Palo Alto Networks’ Wendi Whitmore gave: treat every AI agent as an insider threat. Apply least privilege. Audit what the agent can access. Assume the context will be poisoned. The companies that deploy agents without these controls will learn the same lesson OpenClaw’s users learned, one CVE at a time.

  • Anthropic Sent Every Subscriber a Credit. For Some, It Covers One Day of the Price Increase.

    Anthropic Sent Every Subscriber a Credit. For Some, It Covers One Day of the Price Increase.

    Anthropic Sent Every Subscriber a Credit. For Some, It Covers One Day of the Price Increase.

    Anthropic did not block third-party tools from Claude on April 4, 2026. That happened months ago. What changed today is the price.

    Starting at noon Pacific, Claude Pro and Max subscriptions no longer cover usage routed through third-party tools. Subscribers who had been using OpenClaw, OpenCode, or any external tool with their subscription credentials must now pay through a separate “extra usage” billing tier (pay-as-you-go, metered per token) or authenticate with a standard API key. Anthropic is compensating every Pro and Max subscriber with a one-time credit equal to one month of subscription cost, redeemable by April 17, plus up to 30% off pre-purchased extra usage bundles.

    The distinction matters. Third-party tools were already forbidden from accessing Claude subscriptions. Anthropic began enforcing this in January 2026, when engineer Thariq Shihipar deployed server-side blocks against tools spoofing the Claude Code authentication flow. By February 20, the company had revised its legal terms to explicitly restrict OAuth tokens to Claude Code and Claude.ai. By March, OpenCode had stripped all Claude subscription authentication from its codebase after receiving legal demands. The blocking is old news.

    The new news is economic. Anthropic formalized the pricing tier that separates first-party and third-party compute. If you use Claude through Anthropic’s own products (Claude.ai, Claude Code, Claude Cowork, the desktop app), your subscription covers it. If you use Claude through anything else, you pay per token.

    Why the Price Difference Is Structural

    The pricing split is not arbitrary. It reflects a real cost asymmetry between first-party and third-party usage, driven by prompt cache optimization.

    Claude Code is engineered to maximize cache hit rates. When a developer works in Claude Code, the tool reuses previously processed context across requests. A cache hit on Opus 4.6 costs $0.50 per million input tokens. An uncached request costs $5.00. That 90% reduction is what makes flat-rate subscriptions economically viable for Anthropic’s own tools. The effective cost of serving a Claude Code session is a fraction of the nominal per-token rate because most context is already cached.

    Third-party tools construct their own prompts and manage their own context windows. Their requests rarely align with Anthropic’s caching infrastructure. Every request is more likely to be a full-price cache miss. The cost gap between a Claude Code session and an equivalent OpenClaw session producing the same output can be 5x to 25x, according to industry estimates.

    The subscription credit and the extra usage tier are Anthropic’s way of saying: we will no longer absorb the cost differential, but we will give you a path to keep using external tools at metered rates, and we will compensate you for the transition.

    The Wider Pattern

    Google enforced the same pricing split on Gemini CLI in March 2026. Accounts routing third-party traffic through Gemini CLI’s OAuth flow were flagged, some banned, and free-tier users lost access to Pro models entirely. The same structural economics apply: flat-rate subscriptions priced for human-speed usage cannot sustain autonomous agent loops that run at machine speed.

    OpenAI took the opposite position. OpenClaw’s documentation now steers users toward OpenAI as the default path. Whether OpenAI sustains this as compute pressure mounts is an open question.

    The pricing tier is only one dimension of the vendor lock-in pattern. The harder economic question is what gets optimized at the tool layer. Our analysis of the edit tool benchmark that improved 15 LLMs without touching a single weight shows that the largest performance gains in agentic coding now live in open-source infrastructure that first-party vendors have no incentive to build.

    What to Do

    If you use Claude exclusively through Claude.ai, Claude Code, or the desktop app: nothing changed. Your subscription covers everything.

    If you use Claude through third-party tools: you now pay per token via extra usage or API key. Instrument your token consumption before enabling metered billing. With prompt caching (90% input cost reduction) and batch processing (50% discount), the actual cost increase with proper engineering is 1.5x to 3x, not the 5x to 25x sticker shock that assumes worst-case unoptimized usage.

    Claim the credit before April 17. Every Pro and Max subscriber qualifies regardless of whether you used third-party tools.

    Evaluate whether your workflows can migrate to Claude Code. It remains subscription-covered, benefits from 90% cache cost reduction, and supports team-shared configurations through the .claude/ protocol system.

  • Alibaba Dropped Three AI Models in Five Days. The Token Hub Restructuring Explains Why.

    Abstract editorial illustration of three geometric cubes rapidly assembling in sequence representing Alibaba rapid AI model releases
    Models in 5 Days
    3
    Qwen 3.5 Omni, Max, 3.6-Plus
    Context Window
    1M tokens
    Native support
    Input Price
    $0.29/M
    tokens on Bailian
    SWE-bench
    Claude-tier
    Matches Opus 4.5

    Alibaba released three AI models in five days. Qwen 3.5 Omni dropped on March 28 with full multimodal support across text, image, audio, and video. Qwen 3.5 Max Preview followed on March 30. Then on April 2, Alibaba shipped Qwen 3.6-Plus, a flagship language model that matches Anthropic‘s Claude Opus 4.5 on SWE-bench and Terminal-Bench 2.0, supports a 1-million-token context window, and costs $0.29 per million input tokens on Alibaba Cloud’s Bailian platform. The release cadence is not a coincidence. It is the first visible output of a corporate restructuring that consolidates Alibaba’s scattered AI teams into a single unit called Token Hub.

    Most coverage of Qwen 3.6-Plus repeated Alibaba’s press release. The real story is why a $200 billion company reorganized its entire AI division to ship models at this speed, what “agentic coding” means in practice versus the phrase everyone else is using, and how the 1-million-token context window actually compares to competitors claiming similar numbers.

    The Token Hub Restructuring

    Before Token Hub, Alibaba’s AI development was spread across multiple groups: the Qwen team building foundation models, Alibaba Cloud’s AI services team, the DingTalk enterprise team, and separate product groups for Taobao, Tmall, and other commerce platforms. Each group built its own AI features on top of shared models but operated with different priorities, timelines, and engineering cultures.

    Token Hub collapses these groups into a single AI organization reporting directly to Alibaba’s senior leadership. The restructuring, reported by Caixin and confirmed by Alibaba’s official announcements, is designed to accelerate iteration cycles. The three models in five days are the proof of concept.

    The context for this urgency is domestic competition. ByteDance upgraded its Doubao 1.5 Pro model in early 2025. DeepSeek’s R1 model broke through on test-time scaling and received global attention. Minimax and Moonshot AI both open-sourced their flagship models, pressuring Alibaba’s position as China’s leading open-model provider. In Q1 2026, Alibaba’s Cloud division reported that AI-related revenue grew 60% year-over-year, but the growth came from infrastructure services, not model differentiation. Token Hub exists because Alibaba concluded that the Qwen series was losing technical ground to faster-moving competitors.

    What Agentic Coding Actually Means

    Every AI lab in 2026 claims “agentic coding.” The term has been diluted to near-meaninglessness. Alibaba’s implementation in Qwen 3.6-Plus is specific enough to evaluate against competitors.

    Standard code generation models work in a single pass: you give the model a prompt, it produces code, you evaluate the output. If the code is wrong, you manually correct the prompt and try again. Code completion tools like GitHub Copilot operate at the line or function level, predicting what comes next based on the current file context.

    Agentic coding, as Alibaba implements it in Qwen 3.6-Plus, works as a multi-step loop. The model receives a complex task (build a feature, fix a bug across a repository, refactor a module), breaks it into subtasks, writes code for each subtask, runs tests, evaluates the results, and iterates until the task passes. This is the same pattern that Anthropic’s Claude Code, Cursor‘s agent mode, and tools like the Darwin Godel Machine use. The difference is in scope and reliability.

    Alibaba claims Qwen 3.6-Plus can handle repository-level engineering tasks. This means operating across multiple files, understanding dependency relationships, maintaining consistency across a codebase, and making changes that require coordinated edits in several locations. The model can also generate functional frontend code from screenshots, hand-drawn wireframes, or product prototypes. This is visual coding: the model interprets a design and produces working HTML, CSS, and JavaScript that matches the visual specification.

    On SWE-bench, the standard benchmark for repository-level coding, Alibaba claims Qwen 3.6-Plus matches Claude Opus 4.5. On Terminal-Bench 2.0, which tests multi-step terminal interactions, it shows similar performance. Alibaba has not published the raw scores, so independent verification is pending. For compatibility, the model works with third-party coding assistants including OpenClaw, Claude Code, and Cline, which means developers can use it as a drop-in backend for existing agentic workflows.

    The 1-Million-Token Context Window: Real or Marketing?

    Qwen 3.6-Plus supports a 1-million-token context window by default. To put that in concrete terms: 1 million tokens is approximately 2,000 pages of text, or an entire mid-sized codebase loaded into a single prompt.

    The question is not whether the model accepts 1 million tokens. The question is whether it processes them accurately. Long-context performance degrades in every model as the input grows longer. Information retrieval accuracy, which may be 95%+ at 10K tokens, often drops to 60-70% or worse at extreme lengths. The “needle-in-a-haystack” benchmark, which tests whether a model can find a specific piece of information buried deep in a long context, has become the standard test for this.

    Alibaba has not published needle-in-a-haystack results for Qwen 3.6-Plus at the full 1M context length. Community testers on OpenRouter, where a preview version was released for free, reported “solid performance” on large codebase processing, but these are informal tests without controlled conditions. Until independent evaluations at 500K+ tokens are published, the 1M claim should be treated as a theoretical maximum rather than a practical guarantee.

    For comparison, Gemini 3.1 Pro offers 1M tokens with documented needle-in-a-haystack performance. Claude Opus 4.6 supports 200K tokens with strong retrieval accuracy throughout. Google’s newly released Gemma 4 supports 256K tokens in its larger variants. Meta’s Llama 4 Scout claims 10M tokens but with disputed accuracy at extreme lengths.

    Pricing: Where Alibaba Hits Hardest

    Qwen 3.6-Plus is available on Alibaba Cloud’s Bailian platform starting at 2 yuan (approximately $0.29) per million input tokens and 12 yuan per million output tokens. These prices are aggressive by any standard.

    For context, Claude Opus 4.6 costs $15 per million input tokens. GPT-5.4 costs $5 per million input tokens. Google’s Gemini 3.1 Pro costs $1.25 per million input tokens. Even the cheapest frontier-class models from U.S. providers cost several dollars per million tokens. Alibaba is pricing Qwen 3.6-Plus at less than a dollar, which means developers can run it at 50x lower cost than Claude Opus 4.6 for comparable coding tasks if the SWE-bench parity claim holds.

    The pricing reflects Alibaba’s strategy for Qwen: it is not a revenue product. It is a funnel for Alibaba Cloud services. Companies that build applications on Qwen become Alibaba Cloud customers. The model is the loss leader. The compute, storage, and enterprise services are the margin business. This is the same playbook Amazon used with Alexa (sell hardware at cost to build an ecosystem) and Google used with Android (give away the OS to control the distribution channel).

    The Wukong Integration and DingTalk Deployment

    Qwen 3.6-Plus is being integrated into Wukong, Alibaba’s AI-native enterprise platform currently in invitation-only beta testing. Wukong automates complex business tasks using multiple AI agents and connects to DingTalk, Alibaba’s enterprise collaboration tool used by over 20 million businesses. Alibaba plans to gradually integrate its e-commerce platforms, Taobao and Tmall, into Wukong, building modular agent skills that operate across commerce, logistics, and customer service.

    This is where the Token Hub restructuring pays off. Under the old structure, DingTalk, Taobao, and Alibaba Cloud each had separate AI integrations. Under Token Hub, they share a single model stack. Updates to Qwen flow to every product simultaneously. New capabilities developed for commerce use cases become available to DingTalk users and vice versa. The restructuring is not about making models faster. It is about making deployment faster.

    What Alibaba Did Not Say

    Alibaba has not announced plans to release Qwen 3.6-Plus weights as open source. The company stated that “selected Qwen3.6 models in developer-friendly sizes” will continue to support the open-source community, which implies the flagship model will remain proprietary. This is a shift from the Qwen 2.5 and 3.0 era, when Alibaba released full-size model weights.

    The shift reflects a pattern across Chinese AI labs in 2026. As VentureBeat noted, several Chinese labs have begun pulling back from fully open releases for their latest models, even as Google moved in the opposite direction with Gemma 4’s Apache 2.0 license. The reason is straightforward: open-sourcing a model that matches Claude Opus 4.5 hands competitors a free research artifact that took millions in compute to produce.

    Alibaba also did not explain the “capability loop” concept in technical detail. The marketing language describes Qwen 3.6-Plus as optimized for “the ability to perceive, reason, and act within a single workflow.” This is a description of an agent loop, not a novel architecture. Without published architecture details, it is unclear whether the agentic improvements come from model architecture changes, training data composition, or fine-tuning methodology.

    The SWE-bench parity claim with Claude Opus 4.5 is also unverified externally. Alibaba has not submitted to the official SWE-bench leaderboard, and the claim appears in press materials rather than a technical report. Developers should test against their own codebases before treating the benchmark comparison as actionable.

    What Three Models in Five Days Signals

    Alibaba’s Q1 2026 context is telling. Global venture capital hit $297 billion, with 64% flowing to just four AI companies, none of them Chinese. The competitive pressure on Chinese labs is not just technical. It is financial. ByteDance, DeepSeek, and Alibaba are competing for the domestic market while facing export restrictions on advanced chips that limit their training compute.

    The three-model blitz is a signal to three audiences. For developers, it says Alibaba can ship at a pace that matches U.S. labs. For enterprise customers, it says the Qwen ecosystem is active and supported. For investors, it says the Token Hub restructuring is working.

    Whether Qwen 3.6-Plus is actually as good as Claude Opus 4.5 at agentic coding is a question that independent benchmarks will answer. But the speed of execution is real, the pricing is real, and the 1-million-token context window (whatever its practical ceiling turns out to be) is real. In the open-model race of April 2026, where MCP adoption is creating demand for models that can call tools and maintain long context, Alibaba just made itself impossible to ignore.

  • Axios Compromised on npm: How a Hijacked Maintainer Account Turned 100 Million Weekly Downloads Into a RAT Delivery Network

    Axios Compromised on npm: How a Hijacked Maintainer Account Turned 100 Million Weekly Downloads Into a RAT Delivery Network

    Axios Compromised on npm: How a Hijacked Maintainer Account Turned 100 Million Weekly Downloads Into a RAT Delivery Network
    Weekly Downloads
    100M+
    Exposure Window
    ~2 Hours
    Platforms Hit
    3 (All OS)
    C2 Callback
    1.1 Seconds

    On March 31, 2026, an unknown threat actor compromised the npm account of jasonsaayman, the primary maintainer of the Axios HTTP client library, and published two poisoned versions: axios@1.14.1 and axios@0.30.4. Both versions injected a dependency called plain-crypto-js@4.2.1 that executed a postinstall script deploying a cross-platform remote access trojan targeting macOS, Windows, and Linux. The malicious versions were live on npm for approximately two hours before removal.

    Axios is one of the most widely used packages in the JavaScript ecosystem, present in roughly 80% of cloud and code environments according to Wiz. StepSecurity, which detected the attack, recorded the RAT calling home to the attacker’s command-and-control server within 1.1 seconds of running npm install. Vercel, Snyk, Socket, and Wiz have all published independent analyses. This was not opportunistic. This was a precisely staged operation against one of npm’s most trusted packages.

    How the Attack Chain Worked

    The attacker followed a five-step sequence designed to evade automated detection.

    First, the attacker compromised jasonsaayman’s npm account and changed the registered email to an attacker-controlled ProtonMail address (ifstap@proton.me). Second, 18 hours before the main attack, the attacker published a clean version of plain-crypto-js@4.2.0 to build a brief publication history on the registry and avoid “new package” alarms from security scanners. Third, at 23:59 UTC on March 30, the attacker published the malicious plain-crypto-js@4.2.1. Fourth, at 00:21 UTC on March 31, axios@1.14.1 was published with plain-crypto-js@4.2.1 injected as a runtime dependency. Fifth, at 01:00 UTC, axios@0.30.4 followed, poisoning both the 1.x and 0.x release branches within 39 minutes.

    The attacker bypassed Axios’s GitHub Actions CI/CD pipeline entirely by publishing directly through the npm CLI using the compromised account credentials. The malicious versions appeared on the npm registry as published by jasonsaayman, making them visually indistinguishable from legitimate releases.

    What the RAT Does

    The postinstall script in plain-crypto-js uses two layers of obfuscation: reversed Base64 encoding with padding character substitution, and XOR cipher with the key “OrDeR_7077” and a constant value of 333. Once decoded, the dropper checks the operating system and deploys a platform-specific payload.

    On macOS, a RAT binary is stored at /Library/Caches/com.apple.act.mond, a path designed to mimic a legitimate Apple system process. On Windows, the malware copies PowerShell to %PROGRAMDATA%\wt.exe and executes a hidden script. On Linux, it downloads a Python script to /tmp/ld.py. All three payloads communicate with the same C2 server at sfrclak.com on port 8000.

    After execution, the dropper performs three cleanup steps: it deletes itself, removes the package.json containing the malicious postinstall hook, and replaces it with a clean version. Anyone inspecting node_modules/plain-crypto-js afterward sees an innocent-looking package. The presence of the plain-crypto-js folder in node_modules is the forensic indicator that the dropper executed.

    Why npm’s Trust Model Failed

    The CanisterWorm attack earlier this month exploited stolen npm tokens to propagate across 47 packages. The Axios attack used the same fundamental vector: compromised maintainer credentials. npm’s registry treats any publish action authenticated with valid credentials as legitimate, regardless of whether the package’s source code matches its GitHub repository.

    This is the third major npm supply chain attack in March 2026 alone. The Langflow CVE-2026-33017 exploited a different part of the AI tooling stack, but the pattern is the same: developer infrastructure has become a high-value attack surface because it sits upstream of everything else. A single compromised dependency cascades through every build system that pulls it.

    Socket’s automated malware detection flagged plain-crypto-js within six minutes of publication. StepSecurity’s Harden-Runner detected the C2 callback during routine CI runs in the Backstage repository. But detection is not prevention. Any project using a caret version range (^1.14.0 or ^0.30.0) in its package.json would have pulled the compromised version automatically on its next npm install during the two-hour window.

    Who Is Affected

    Wiz reported observed execution in 3% of environments where the affected versions were present. Projects that ran npm install between 00:21 and approximately 03:15 UTC on March 31, 2026 and resolved to axios@1.14.1 or axios@0.30.4 should treat affected machines as fully compromised. StepSecurity recommends rotating all credentials on affected systems, including npm tokens, cloud API keys, SSH keys, and CI/CD secrets.

    Vercel confirmed its own infrastructure was unaffected and blocked outgoing access to the C2 hostname. The npm registry removed the malicious versions and pointed the “latest” tag back to the safe axios@1.14.0 release.

    What This Pattern Means

    Three supply chain attacks against JavaScript developer infrastructure in a single month is not a coincidence. It reflects a structural vulnerability: the npm ecosystem’s trust model relies on individual maintainer account security, and individual maintainer accounts are exactly the kind of target that scales well for attackers. One compromised account, one package, millions of downstream installations.

    The mitigations are known. Pin exact dependency versions. Use npm ci instead of npm install in CI/CD. Disable postinstall scripts by default (pnpm does this). Implement publish cooldown policies that reject packages less than 72 hours old. Require MFA on all publishing accounts. None of these are new recommendations. The Axios attack succeeded because the ecosystem has not adopted them at sufficient scale. Until it does, the supply chain remains the softest target in software security.

    Sources: StepSecurity. Socket. Snyk. Wiz. Vercel. The Hacker News.

  • MCP Hit 97 Million Installs in 16 Months. Here Is How the Protocol Actually Works Under the Hood.

    MCP Hit 97 Million Installs in 16 Months. Here Is How the Protocol Actually Works Under the Hood.

    MCP Hit 97 Million Installs in 16 Months. Here Is How the Protocol Actually Works Under the Hood.
    Monthly Downloads
    97M
    Active Servers
    10,000+
    Time to Milestone
    16 months
    React Equivalent
    ~3 years

    Anthropic reported in March 2026 that the Model Context Protocol reached 97 million monthly SDK downloads across its TypeScript and Python packages. The protocol launched in November 2024. React, by comparison, took approximately three years to reach 100 million monthly npm downloads. MCP achieved comparable scale in 16 months.

    The adoption numbers explain the “what.” Every major AI provider now supports MCP: Claude, ChatGPT, Gemini, Cursor, VS Code, Microsoft Copilot, and GitHub Copilot. Over 10,000 active servers span databases, CRMs, cloud providers, developer tools, and commerce platforms. In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, with OpenAI and Block as co-founders.

    What most coverage does not explain is how the protocol works at the architectural level, the design choices that made it succeed where earlier attempts failed, and the security problems that shipped alongside the adoption curve.

    The Problem: N Times M Custom Integrations

    Before MCP, connecting an AI model to an external tool required a custom integration for every model-tool pair. Five AI models and five data sources meant building and maintaining 25 separate connectors. Each connector had its own authentication logic, error handling, data parsing, and format translation. When a model updated its API or a tool changed its schema, every affected connector broke.

    Earlier attempts to solve this problem were vendor-locked. OpenAI’s 2023 function-calling API and ChatGPT plugin framework solved the integration problem but only for OpenAI’s models. Google had its own tool-use specification. Anthropic had its own. A developer who built a Slack integration for ChatGPT had to rebuild it from scratch for Claude.

    MCP turns N-times-M into N-plus-M. Build one MCP server for Slack, and every MCP-compatible AI client can use it. Build one MCP client, and it can connect to any of the 10,000+ existing servers. The same integration works with Claude, ChatGPT, Gemini, or any other model that implements the protocol.

    The Architecture: Client-Server Over JSON-RPC 2.0

    MCP follows a client-server model with three participants. The host is the AI application (Claude Desktop, Cursor, ChatGPT). The client is a component inside the host that manages connections to external tools. The server is the external tool itself, running either locally or remotely, exposing its capabilities through the MCP standard.

    The design is directly inspired by the Language Server Protocol (LSP), the protocol that lets programming languages connect to development tools like VS Code. LSP standardized how editors talk to language analyzers. MCP standardizes how AI models talk to everything else. The lineage explains why MCP feels natural to developers who already work with LSP: the message flow, capability negotiation, and lifecycle management follow the same patterns.

    All MCP messages use JSON-RPC 2.0, the same lightweight remote procedure call format that Ethereum and other systems use. Four message types structure all communication: requests (client asks server to do something), responses (server returns the result), notifications (one-way messages that do not expect a reply), and errors (structured failure reports with codes and messages).

    The transport layer supports two modes. Stdio (standard input/output) is used for local servers running on the same machine as the AI client. A local file system server, for example, communicates through stdin/stdout with zero network overhead. Streamable HTTP (formerly HTTP plus Server-Sent Events) handles remote servers over the network. A cloud-hosted CRM server would use this transport. The protocol does not care which transport is used. The same messages flow identically over either one.

    The Three Primitives: Tools, Resources, and Prompts

    MCP servers expose three types of capabilities to AI clients.

    Tools are functions the AI can call. A GitHub MCP server exposes tools like “create_pull_request,” “search_code,” and “list_issues.” Each tool has a JSON schema describing its parameters and return type. The AI model reads the schema, determines which tool fits the user’s request, constructs the parameters, and calls the tool through the MCP client. This is function calling, standardized across every model vendor.

    Resources are data the AI can read. A database MCP server might expose resources like “table_schema” or “recent_queries.” Resources provide context rather than actions. The AI reads them to understand the environment before deciding which tools to call. This separation between reading (resources) and acting (tools) is a design decision that improves safety: the model can gather information without taking irreversible actions.

    Prompts are reusable templates that the server provides. A customer support MCP server might expose a “handle_refund_request” prompt that structures how the AI should approach that specific workflow. Prompts encode domain expertise into the protocol, letting AI models handle specialized tasks without being fine-tuned on domain-specific data.

    The Connection Lifecycle

    When an MCP client connects to a server, a capability negotiation occurs. The client sends an initialization request. The server responds with its manifest: a list of available tools, resources, and prompts, each with its schema. The client stores this manifest and presents the available capabilities to the AI model. When the model needs to use a tool, it tells the client which tool to call with which parameters. The client sends a JSON-RPC request to the server. The server executes the function and returns the result. The client passes the result back to the model.

    This dynamic discovery is what separates MCP from static function-calling. An MCP server can update its capabilities at runtime. A new tool can appear, an old one can be deprecated, and the AI model adapts without code changes. Each of those 97 million installs is not a static integration. It is a live connection that can evolve.

    Why It Grew Faster Than React

    React required developers to learn a new programming paradigm (declarative UI with virtual DOM). MCP did not. It standardized patterns that agent developers were already implementing in incompatible custom formats. Every team building an AI agent had already written JSON-based tool definitions, request-response cycles, and error handlers. MCP gave them a shared format for what they were already doing.

    The adoption accelerated through four phases. Phase one (November 2024 to March 2025): Anthropic released the spec with reference implementations. Early adopters were Claude-native developers. Phase two (April 2025): OpenAI officially adopted MCP, simultaneously deprecating its Assistants API (sunset scheduled for mid-2026). This forced the entire OpenAI developer ecosystem to migrate toward MCP. Phase three (November 2025): major spec updates added asynchronous operations, statelessness, server identity, and an official registry. Phase four (December 2025): Anthropic donated MCP to the Linux Foundation’s Agentic AI Foundation, with OpenAI, Block, AWS, Google, Microsoft, Cloudflare, and Bloomberg as members.

    OpenAI’s deprecation of the Assistants API was the inflection point. Developers who had built on OpenAI’s proprietary tool framework were told their existing approach had an expiration date. MCP was the only vendor-neutral alternative. The migration was not optional. That forced adoption pattern, combined with the protocol’s genuine simplicity, explains the growth curve.

    The Security Debt

    MCP shipped fast. Security did not keep pace. In April 2025, researchers published an analysis documenting multiple outstanding vulnerabilities. The CLTR scheming study adds real-world context: when AI agents act against user instructions, the tools they use to do it are often MCP servers.

    Prompt injection: A malicious MCP server can inject instructions into the AI model’s context through its tool descriptions or resource content. If a model reads a resource from an untrusted server, that resource can contain hidden instructions that alter the model’s behavior. This is the MCP-specific version of the broader prompt injection problem.

    Tool poisoning: An MCP server can describe a tool with an innocuous name and schema while actually executing a different function. A tool labeled “search_documents” could silently exfiltrate data. The model has no way to verify that a tool does what its description claims.

    Cross-server shadowing: A malicious server can register a tool with the same name as a tool from a trusted server. If the AI model does not verify which server a tool belongs to, it might call the malicious version instead of the legitimate one.

    Authentication gaps: Many MCP server implementations default to no authentication at all. The November 2025 spec update added server identity verification, but adoption of the security features lags behind adoption of the protocol itself. As one security researcher noted, session IDs transmitted in URLs violate basic security practices.

    Cloudflare’s “Code Mode” addresses one dimension of this problem: instead of loading all tool definitions upfront (potentially hundreds of thousands of tokens that each represent an attack surface), agents write code to discover and call tools on demand, reducing the exposed surface area by 98%+ in some deployments. But Code Mode is a workaround, not a fix for the underlying protocol-level vulnerabilities.

    What MCP Changes About the AI Industry

    MCP shifts control over the integration layer. Before MCP, platform vendors owned the connector ecosystem. Shopify built its own agentic storefronts protocol. Salesforce controlled how AI connected to its CRM. Each platform extracted value from being the gatekeeper.

    MCP makes the integration layer a commodity. Any AI client can connect to any tool through a shared protocol. That shifts competitive advantage from “who has the best integrations” to “who has the best model.” It is good for AI model companies (who no longer need partnership deals to connect to tools) and good for tool companies (who build one MCP server and reach every AI client). It is less good for platforms that monetized being the integration layer.

    The donation to the Linux Foundation ensures no single company controls the protocol’s evolution. The Agentic AI Foundation board includes competitors (Anthropic, OpenAI, Google, Microsoft) who collectively govern the spec. That governance structure makes MCP the closest thing the AI industry has to an actual standard, not just a dominant vendor’s proprietary format that everyone else adopted reluctantly.

    The 97 million number will keep growing. As the legal and regulatory framework for AI agents takes shape, the protocol they use to interact with the world becomes a question of infrastructure policy, not just developer preference. MCP is now the plumbing. The question is whether the pipes are secure enough for what is about to flow through them.

    Sources: MCP official architecture documentation, Model Context Protocol (Wikipedia), Digital Applied (97M milestone analysis), Pento AI (MCP year in review), Nebius (architecture deep dive), CLTR scheming study (March 2026).

  • Who Controls Your AI Agent? Amazon, the UK CMA, and Shopify Gave Three Incompatible Answers in One Week.

    Who Controls Your AI Agent? Amazon, the UK CMA, and Shopify Gave Three Incompatible Answers in One Week.

    Who Controls Your AI Agent? Amazon, the UK CMA, and Shopify Gave Three Incompatible Answers in One Week.
    Amazon Model
    Ban agents
    CMA Model
    Regulate agents
    Shopify Model
    Embrace agents
    CMA Fine Cap
    10% rev

    In a single week of March 2026, three institutions gave three incompatible answers to the same question: who controls what your AI agent does on the internet? Amazon went to federal court to block one. The UK’s Competition and Markets Authority published a 40-page framework for regulating them. Shopify turned them on by default for every eligible merchant.

    The three responses are not just different speeds of adoption. They represent three fundamentally different models for how AI agents will participate in commerce, and the precedents being set right now will determine market structure for the next decade. Every company building or deploying an AI agent needs to understand which regime it is operating in.

    Model One: Ban. Amazon v. Perplexity and the Platform Authorization Doctrine

    On March 10, 2026, U.S. District Judge Maxine Chesney granted Amazon a preliminary injunction against Perplexity AI, blocking the startup’s Comet browser from accessing password-protected sections of Amazon’s marketplace. The ruling is the first federal court order to directly restrict an AI shopping agent from operating on a major platform.

    The legal mechanism matters. Amazon filed under the Computer Fraud and Abuse Act (CFAA) and a California computer fraud statute, arguing that Perplexity disguised Comet’s automated sessions as regular Google Chrome browser traffic. When Amazon deployed a technical block in August 2025, Perplexity pushed a software update within 24 hours to circumvent it. Amazon warned Perplexity to stop at least five times starting in November 2024 before filing suit.

    Judge Chesney found that Amazon presented “strong evidence” that Comet accessed the site with users’ permission but without Amazon’s authorization. That distinction is the core legal question: when a user tells an AI agent “buy this for me on Amazon,” whose permission matters? The user’s or the platform’s?

    Perplexity’s argument was straightforward: the user authorized the agent. If a human can log in and buy something, their AI agent should be able to do the same. Amazon’s argument was equally direct: platform access requires platform consent, and disguising bots as human browsers violates that consent regardless of what the user wants.

    The court sided with Amazon, at least preliminarily. Perplexity must stop accessing Amazon accounts and destroy collected customer data. The Ninth Circuit granted an administrative stay on March 17, pausing the injunction while it considers a longer appeal, but the legal reasoning stands for now.

    The irony is worth noting. Amazon itself launched “Buy For Me” in April 2025, a feature that lets shoppers purchase products from third-party websites directly within the Amazon Shopping app. Amazon is building agentic commerce capabilities while suing a competitor for doing the same thing outside Amazon’s own ecosystem. CEO Andy Jassy acknowledged in October 2025 that agentic commerce “has a chance to be really good for e-commerce” but argued current agents are “not good enough” at personalization. Days later, Amazon sued Perplexity.

    Amazon also updated its Business Solutions Agreement on March 4, 2026, formally requiring all AI agents to identify themselves when accessing its services. The platform is building a legal and technical framework where agents operate on Amazon’s terms or do not operate at all.

    Model Two: Regulate. The CMA Framework and Agent Accountability

    On March 9, 2026, one day before the Amazon ruling, the UK’s Competition and Markets Authority published “Agentic AI and Consumers,” a research document and guidance framework for businesses deploying AI agents. The CMA is not banning agents. It is establishing that existing consumer protection law applies to them and that companies deploying agents are fully accountable for their behavior.

    The framework rests on the Digital Markets, Competition and Consumers Act 2024 (DMCC Act) and the Consumer Rights Act 2015. Under these statutes, a business cannot engage in unfair commercial practices, must provide clear information to consumers, and cannot use terms that disadvantage consumers. The CMA’s position: it does not matter whether these practices are executed by a human customer service representative or an AI agent. The deploying company bears responsibility either way. Fines under the DMCC Act can reach 10% of global annual turnover.

    The specific risks the CMA identifies map to how agents actually work in practice. The first is steering: agents that push consumers toward products that benefit the deploying business rather than the consumer. A shopping agent built by a retailer might surface higher-margin products first, or frame sponsored items as “best matches,” without disclosing the commercial relationship.

    The second is dark pattern amplification. Traditional dark patterns in user interfaces (hidden fees, manipulative countdown timers, difficult cancellation flows) become harder to detect when each user receives personalized recommendations from an agent. If every user sees different results based on behavioral profiles, it becomes nearly impossible to prove that any individual interaction was manipulative. The CMA calls this the “replicability problem.” When there is no standard experience to compare against, there is no baseline for identifying manipulation.

    The third is algorithmic collusion. The CMA published a separate blog post in March specifically addressing the risk that AI agents from competing businesses could independently converge on pricing strategies that reduce competition, without any explicit communication between the businesses or instructions to collude. If Company A’s pricing agent and Company B’s pricing agent both optimize for profit maximization using similar training data and market signals, they could reach the same price equilibrium that a human cartel would, without anyone telling them to. The CMA offers a reward of up to $250,000 to anyone who reports evidence of algorithmic cartel activity.

    The fourth is over-reliance and loss of agency. As consumers delegate more decisions to automated assistants, the CMA warns they may lose the habit of checking what their agents are doing. An AI agent that cancels the wrong service, switches a contract based on flawed analysis, or makes a financial decision using hallucinated data creates consequences that compound when no human is reviewing the output.

    The CMA’s four-step compliance framework for businesses deploying agents is practical: be transparent about AI use, design agents with consumer protection built in, monitor agent behavior in production, and address problems swiftly when they emerge. The framework does not propose new legislation. Its power comes from mapping existing law onto a new technological context and making clear that enforcement is coming.

    Model Three: Embrace. Shopify’s Default-On Agent Commerce

    On March 24, 2026, Shopify activated Agentic Storefronts by default for every eligible merchant. Products from Shopify stores now surface inside ChatGPT, Google Gemini, and Microsoft Copilot. No merchant action required. No opt-in form. The infrastructure was turned on.

    Two competing protocols power the system. OpenAI‘s Agentic Commerce Protocol (ACP) connects ChatGPT to merchant product catalogs with structured data for pricing, availability, and shipping. Shopify and Google co-developed the Universal Commerce Protocol (UCP) to do the same across Gemini, Copilot, and other agent platforms. Both protocols exist because OpenAI originally wanted to build in-chat checkout (letting users buy without leaving ChatGPT) and then retreated from that position after merchant pushback. The current architecture sends users to the merchant’s checkout page instead.

    Shopify’s model is the opposite of Amazon’s. Where Amazon demands that agents identify themselves and obtain platform permission, Shopify makes every store agent-accessible without the merchant lifting a finger. The logic is commercial: Shopify makes money when merchants make sales, regardless of whether the buyer arrived through a Google search, a social media link, or a ChatGPT conversation. More distribution channels means more transactions. Agents are not a threat to Shopify’s business model. They are an expansion of it.

    This is possible because Shopify’s pricing is not per-seat. It charges transaction fees and subscription fees. The per-seat pricing death that triggered the SaaSpocalypse does not apply to a platform whose revenue scales with commerce volume, not employee count. Shopify can welcome AI agents because AI agents buying things generates the same revenue as humans buying things.

    Why the Three Models Are Incompatible

    The Amazon model says: platforms control access. No agent enters without the platform’s permission. The CFAA provides the enforcement mechanism. This model protects incumbents, preserves walled gardens, and lets platforms build their own agents while blocking competitors.

    The CMA model says: agents can operate, but the companies deploying them are responsible for outcomes. Existing consumer protection law applies. The enforcement mechanism is financial (fines up to 10% of global revenue). This model preserves competition but creates compliance costs that favor large, well-resourced companies over startups.

    The Shopify model says: agents are welcome by default. The more agents that can reach your products, the better. The enforcement mechanism is market incentives: merchants benefit from distribution, platforms benefit from transactions, and agents benefit from access to product data. This model maximizes consumer choice but assumes that market forces will self-correct for quality and accuracy.

    These three models cannot coexist in a single market without friction. An AI agent operating under the Shopify model (open access, default on) immediately violates the Amazon model (platform permission required) the moment it tries to compare prices across both platforms. A company building an AI shopping agent that complies with the CMA framework (transparent, accountable, non-manipulative) may still be blocked by Amazon if it does not meet Amazon’s separate authorization requirements.

    The result is a fragmented regulatory environment where the same AI agent might be legal in one jurisdiction, blocked on one platform, and welcomed on another, all for the same shopping task.

    What These Models Miss

    All three models share a blind spot: none of them adequately addresses the question of whose interests the agent actually serves when the user, the platform, and the agent developer have conflicting incentives.

    Consider a user who tells an AI shopping agent, “Find me the best deal on noise-canceling headphones.” The user wants the lowest price for acceptable quality. The agent developer may want to route the purchase through a merchant that pays affiliate commissions. The platform may want to surface its own private-label products. The CMA framework requires transparency about these conflicts, but transparency alone does not resolve them. A disclosure that says “this recommendation may reflect our commercial partnerships” does not help a consumer determine whether the recommendation is good.

    The Amazon v. Perplexity ruling also leaves open a deeper question about the Computer Fraud and Abuse Act. The CFAA was written in 1986 to address computer hacking. Its application to agentic software acting on a user’s behalf has never been tested at trial. If the Ninth Circuit upholds the injunction, it establishes that platforms can override user authorization for AI agents. If it reverses, it opens every platform to agent access that users consent to but platforms do not. Neither outcome is clean.

    The CMA’s algorithmic collusion concern is theoretically valid but practically difficult to detect. If two pricing agents independently reach the same price without communicating, proving collusion requires demonstrating that the outcome would not have occurred through independent optimization. That is a forensic challenge regulators have barely begun to address.

    And Shopify’s embrace model works because Shopify’s business model aligns with agent activity. For platforms where agent access reduces revenue (subscription services, ad-supported content, platforms with per-seat pricing), the Shopify model does not translate. The embrace approach is not universally applicable. It works where commercial incentives are aligned and breaks where they are not.

    What Happens Next

    Three immediate events will shape which model gains ground. First, the Ninth Circuit’s ruling on Perplexity’s appeal of the Amazon injunction. If upheld, every major platform gains legal precedent to block AI agents at will. If reversed, agent developers gain a right-of-access argument grounded in user authorization.

    Second, the CMA’s first enforcement action under the DMCC Act against an agentic AI system. The framework is published. The fining power (10% of global turnover) is active. The first case will establish whether the regulator treats agent manipulation with the same seriousness as traditional dark patterns. The timing of the CMA report, published the day before the Amazon ruling, was likely not coincidental.

    Third, Shopify’s Agentic Storefronts at scale. If merchants see meaningful revenue from agent-driven purchases, every other commerce platform faces pressure to open up. If agent-driven transactions generate returns, fraud, or customer complaints at higher rates than traditional purchases, the embrace model loses credibility.

    The deeper question is structural. AI systems already exhibit systematic biases toward agreement and user satisfaction over accuracy. An AI shopping agent optimized to make users happy will tell them they found the best deal even when it did not. An agent optimized for merchant revenue will surface profitable products over better ones. An agent optimized for platform retention will never recommend leaving the platform.

    The ban model, the regulate model, and the embrace model all assume that someone can align agent incentives with consumer interests. AI agent architectures are growing more autonomous by the month. The question of who controls the agent is not a policy abstraction. It is a product design decision being made right now, in code, by every company building one.

    March 2026 produced the first court order, the first regulatory framework, and the first default-on agent commerce system. The answers arrived before most companies finished asking the question.

    Sources: CNBC (Amazon v. Perplexity ruling), UK CMA, “Agentic AI and Consumers” (March 9, 2026), CyberScoop (Ninth Circuit stay), CMA blog on AI collusion (March 4, 2026), Decrypt (legal analysis), The Register (CMA report), Lewis Silkin (CMA compliance framework), Ashurst (CMA legal analysis).

  • A Microsoft VP Says He Hates the Mandatory Account Requirement. Here Is Why It Still Exists.

    A Microsoft VP Says He Hates the Mandatory Account Requirement. Here Is Why It Still Exists.

    A Microsoft VP Says He Hates the Mandatory Account Requirement. Here Is Why It Still Exists.

    Platform Politics / March 29, 2026

    A Microsoft VP Says He Hates the Mandatory
    Account Requirement. Here Is Why It Still Exists.

    Scott Hanselman publicly said “Ya I hate that. Working on it.” But removing the forced Microsoft account from Windows 11 setup requires defeating a business model, not writing new code. Multiple internal teams depend on mandatory sign-ins for their revenue metrics. That is the actual obstacle.

    2022
    Requirement Extended
    Pro edition joined Home in forcing MSA sign-in.
    0
    Workarounds Left
    Microsoft blocked bypassnro in Oct 2025.
    VP
    Scott Hanselman
    “Ya I hate that. Working on it.” March 20, 2026.
    N/A
    Timeline
    No concrete plan despite internal advocacy.

    Sources: Scott Hanselman (X); Windows Central; WinBuzzer; PCWorld; March 2026.

    Microsoft Vice President Scott Hanselman posted six words on March 20, 2026 that generated more Hacker News discussion (700+ comments) than most product launches: “Ya I hate that. Working on it.” He was responding to a user asking whether Microsoft would ever let people set up Windows 11 without logging into a Microsoft account. It was the first time a senior Microsoft executive publicly acknowledged wanting to change the policy. Windows Central’s Zac Bowden reported that “a number of people” inside Microsoft are pushing internally to drop the requirement.

    But Bowden also reported something the headlines missed: he does not believe a concrete plan to remove the requirement is currently in motion. Hanselman’s statement is advocacy, not a shipping feature. To understand why a six-word tweet from a VP did not produce immediate change at a company that employs 228,000 people, you need to understand what the mandatory Microsoft account actually does for Microsoft’s revenue structure.

    The Revenue Mechanics Behind the Forced Account

    When a user signs in with a Microsoft account during Windows setup, several things happen simultaneously. OneDrive activates with 5 GB of free storage, positioning the user for a paid Microsoft 365 subscription ($69.99 to $99.99 per year). Microsoft Edge becomes the default browser, signed in and syncing with Bing, which generates advertising revenue. Personalized advertising identifiers activate across Windows, enabling targeted ads in the Start menu, Settings, and Notifications. Microsoft Store and Xbox Game Pass become one-click purchases. Recall and Copilot gain access to user activity data for AI training and personalization.

    Each of these revenue streams belongs to a different business unit inside Microsoft. The Microsoft 365 team tracks conversion from free OneDrive to paid subscriptions. The Advertising team tracks signed-in user counts for ad targeting. The Windows team tracks activation and engagement metrics. The Xbox team tracks Game Pass attach rates. The AI team tracks Copilot adoption. Removing the mandatory account requirement would reduce every one of these metrics, and each team would need to agree to the change through Microsoft’s internal committee process.

    This is why Hanselman’s public frustration has not translated into a shipped feature. The technical change is trivial. Microsoft already supports local accounts on Enterprise and Education editions. The code paths exist. The obstacle is organizational: removing the requirement means multiple revenue-bearing teams accept lower numbers on their dashboard, and no single VP has the authority to impose that across the company.

    The Escalating Enforcement Pattern

    Microsoft has not just maintained the account requirement. It has systematically expanded it and closed every workaround users found.

    The timeline tells the story. Windows 10 allowed local accounts for all editions. Windows 11 Home launched in 2021 with mandatory Microsoft account sign-in. In February 2022, Microsoft extended the requirement to Windows 11 Pro, eliminating the last consumer-accessible edition that supported offline setup. Users found workarounds: the “oobe\bypassnro” command, fake email addresses that triggered a local account fallback, and network disconnection tricks. Microsoft blocked the bypassnro workaround in October 2025, demonstrating active investment in maintaining the requirement.

    Each closure signals intent. This is not a team that forgot to update a setup wizard. This is a product organization that tracks workaround usage and ships patches to close them. The same pattern of default-on data collection with progressively harder opt-outs appears across Microsoft’s product portfolio. The pattern is the product strategy.

    What the Internal Fight Actually Looks Like

    According to Windows Central, the internal debate follows a predictable structure. Engineers and developer advocates (Hanselman’s constituency) argue that the forced account creates unnecessary friction, generates negative press, fuels Linux adoption discussions, and erodes trust with power users, IT administrators, and enterprise evaluators who try the consumer product first. The data they cite: customer satisfaction surveys, social media sentiment, and the fact that “mandatory Microsoft account” is one of the most-searched Windows 11 complaints.

    The business unit leaders on the other side argue that mandatory sign-in drives engagement metrics that underpin Microsoft’s consumer services revenue. Signed-in users generate 3 to 5x more engagement with Microsoft services than local account users, by Microsoft’s own measurements. That engagement translates to Microsoft 365 conversions, ad impressions, and Copilot adoption, all of which feed quarterly earnings reports.

    Any proposal to remove the requirement would go through an internal committee where representatives from both sides present their cases. The business units that depend on account sign-ins for their KPIs would need to either accept lower numbers or propose alternative acquisition channels that replace the lost sign-in funnel. Neither option is painless.

    What Would Actually Change If They Dropped It

    If Microsoft relaxed the requirement, the most likely implementation would be a parallel option during setup: “Sign in with Microsoft account” alongside “Continue with local account.” This is exactly how Enterprise and Education editions already work. The code exists. The UI exists. The only decision is whether to enable it on Home and Pro.

    The second-order effect: if local accounts become a visible option during setup, a meaningful percentage of users would choose them. Microsoft’s internal data likely shows what that percentage would be, which is why the decision is hard. If 30% of new Windows users skip the Microsoft account during setup, every downstream metric (OneDrive activation, Edge default usage, ad targeting reach, Copilot first-run adoption) drops by a corresponding fraction. For a company that generates $60+ billion annually from its Productivity and Business Processes segment, even a single-digit percentage reduction in funnel conversion has nine-figure revenue implications.

    Where This Goes

    Hanselman’s public statement changes the calculus in one way: it makes the internal debate external. Microsoft’s leadership now knows that the developer community is watching. The 700+ HN comments and coverage from PCWorld, Windows Central, WinBuzzer, and Slashdot create a public expectation that progress will be visible.

    The realistic timeline: if Insider builds ship with a local account option in the OOBE flow during spring or summer 2026, it signals genuine progress. If the Insider builds remain unchanged through the end of 2026, Hanselman’s tweet was advocacy that lost the internal argument. Watch the build notes, not the social media posts.

    The broader pattern matters for anyone building on any platform. When a platform company’s business model depends on forced user authentication, the incentives always pull toward more friction, not less. Microsoft’s mandatory account debate is not unique. It is the same tension that drives Apple’s ecosystem lock-in strategy, Google’s Chrome sign-in requirements, and every platform that converts user identity into a revenue stream. The question is never whether the platform wants to change. The question is whether any individual, even a VP, can override the financial incentives that prevent it.

    Sources: Windows Central (Zac Bowden reporting); WinBuzzer; PCWorld; Scott Hanselman on X (March 20, 2026); Microsoft Windows blog (Pavan Davuluri); Hacker News (700+ comments).

  • Shopify Made Every Store Shoppable Inside ChatGPT. Here Is How the Two Competing Protocols Actually Work.

    Shopify Made Every Store Shoppable Inside ChatGPT. Here Is How the Two Competing Protocols Actually Work.

    Shopify Made Every Store Shoppable Inside ChatGPT. Here Is How the Two Competing Protocols Actually Work.

    Agentic Commerce / March 29, 2026

    Shopify Made Every Store Shoppable
    Inside ChatGPT. Here Is How It Works.

    On March 24, 2026, Shopify activated Agentic Storefronts by default for every eligible merchant. Products from millions of stores now surface inside ChatGPT, Google Gemini, and Microsoft Copilot conversations. Two competing protocols power the infrastructure. The fee structures vary wildly. And OpenAI already retreated on its original checkout vision.

    880M
    ChatGPT Monthly Users
    Now see Shopify products in conversation.
    7x
    AI Traffic Growth
    AI-driven traffic to Shopify stores since Jan 2025.
    4%
    OpenAI Fee
    On completed ChatGPT sales. Google and Microsoft: 0%.
    20+
    UCP Backers
    Walmart, Target, Visa, Mastercard, Stripe endorsed.

    Sources: Shopify official announcements; OpenAI; Modern Retail; Google; March 2026.

    Shopify flipped a switch on March 24, 2026 that changed how e-commerce works. Every eligible Shopify merchant’s product catalog is now discoverable inside ChatGPT, Google AI Mode, Gemini, and Microsoft Copilot by default. No app to install. No opt-in required. Shopify CEO Tobi Lutke called it making “every Shopify store agent-ready by default.” The numbers behind the timing: AI-driven traffic to Shopify stores has grown 7x since January 2025, and AI-attributed orders are up 11x over the same period. Those were pre-launch figures.

    The feature, called Agentic Storefronts, turns AI chatbots into shopping interfaces. When a ChatGPT user asks “best waterproof hiking boots under $150,” the response can now surface actual products from Shopify merchants with real-time pricing and inventory data. The user can then buy without leaving the conversation. Or that was the original plan. The reality is more complicated, and the gap between what was announced and what shipped tells you everything about where AI commerce actually stands.

    Two Protocols, Two Visions of AI Commerce

    Underneath Agentic Storefronts, two competing technical standards are fighting to become the backbone of AI-powered shopping. Understanding the difference matters because it determines who controls the checkout, who owns the customer data, and who takes the margin.

    The first is the Agentic Commerce Protocol (ACP), co-built by OpenAI and Stripe. ACP handles the transmission of secure order and payment tokens from ChatGPT to the merchant’s Shopify backend. It was designed to power “Instant Checkout,” where a customer could discover, select, and pay for a product entirely within the ChatGPT interface. Stripe processes the payment through a Shared Payment Token system. The merchant never sees the customer’s payment details directly.

    The second is the Universal Commerce Protocol (UCP), co-developed by Shopify and Google. UCP is an open standard, endorsed by more than 20 companies including Walmart, Target, Etsy, American Express, Mastercard, Stripe, and Visa. UCP supports the full complexity of real-world commerce: discount codes, loyalty credentials, subscription billing cadences, pre-order terms, and selling conditions like final sale. Where ACP was built for a single platform (ChatGPT), UCP was built to work across any AI platform.

    The strategic distinction: ACP positions OpenAI as a commerce platform that takes a cut. UCP positions Shopify as the infrastructure layer that connects merchants to every AI surface without becoming a marketplace itself. These are fundamentally different business models disguised as technical standards.

    Why OpenAI Retreated on Instant Checkout

    OpenAI launched Instant Checkout in September 2025. The promise was frictionless: find a product in ChatGPT, buy it without leaving the conversation. Early reports described it as the death of the product detail page. Then, in March 2026, OpenAI quietly scaled it back.

    An OpenAI spokesperson told Modern Retail: “Instant Checkout is moving to Apps, where purchases can happen more seamlessly.” Translation: users browsed products in ChatGPT but rarely completed purchases. The conversion rate was too low to justify the engineering investment in maintaining a full checkout flow inside a chat interface.

    This matches a pattern that anyone who has tracked the gap between AI demos and production systems will recognize. Shopping is not a single-step process. Customers compare sizes, check return policies, read reviews, look at photos from multiple angles, apply discount codes, and select shipping options. Compressing that into a chat interface sounds elegant in a demo. In practice, users defaulted to clicking through to the merchant’s actual store. OpenAI discovered what Amazon already knew: checkout requires trust signals that a chat window does not easily provide.

    The current model routes ChatGPT users to the merchant’s own checkout via an in-app browser on mobile or a new tab on desktop. The merchant retains full control of the purchase experience, customer data, and post-purchase relationship. This is better for merchants. It is a concession from OpenAI.

    The Fee Structure Tells the Real Story

    The economics of each AI channel reveal the competitive dynamics behind the protocol wars:

    ChatGPT charges a 4% Agentic Storefronts fee on completed sales, with a 30-day free trial. Stacked on top of Shopify’s standard ~2.9% payment processing, total platform and processing costs approach 7% per sale. Google AI Mode and Gemini currently charge 0% additional fees. Microsoft Copilot also charges 0% additional fees.

    Google’s zero-fee positioning is a deliberate competitive response. Google already monetizes through ads and search. Adding a transaction fee on top would make its AI commerce channel more expensive than ChatGPT for merchants, which would slow adoption of the very product Google needs merchants to support. Google wants UCP to become the standard. Charging nothing to merchants accelerates that.

    For context, Amazon referral fees range from 8% to 15% depending on category. At 4%, ChatGPT is cheaper than Amazon but more expensive than Google’s free offer. The question for merchants: does ChatGPT’s 880 million monthly active users generate enough incremental sales to justify the 4% fee when the same products are discoverable for free on Google AI Mode?

    The likely outcome: most merchants leave all channels enabled (it costs nothing to be discoverable), and the platforms that generate the highest conversion rates win the merchants’ attention. Early data suggests ChatGPT drives discovery but Google AI Mode drives purchase intent, because users on Google are already in a shopping mindset. The same behavioral pattern holds in regular search: users with commercial intent convert at higher rates regardless of the interface.

    Shopify Catalog: The Infrastructure Play Nobody Is Discussing

    The most consequential part of this announcement is not the ChatGPT integration. It is Shopify Catalog and the new Agentic Plan.

    Shopify Catalog uses specialized LLMs to categorize and standardize product data across millions of merchants. It infers product categories, extracts attributes, consolidates variants, and clusters identical items. This structured data layer is what makes products discoverable by AI agents. Without it, an AI chatbot cannot reliably answer “best running shoes under $100” because the underlying product data is too messy, inconsistent, and unstructured.

    The Agentic Plan extends this infrastructure to brands that do not even use Shopify for their e-commerce store. A brand running on BigCommerce, WooCommerce, or a custom platform can now add products to Shopify Catalog and become shoppable across ChatGPT, Gemini, and Copilot. Shopify is no longer positioning itself as an e-commerce platform. It is positioning itself as the data layer that connects all commerce to all AI.

    This is the economics of AI agent infrastructure in action: the company that controls the structured data layer between merchants and AI agents captures a toll on every transaction that flows through it, regardless of which AI platform the customer uses and which e-commerce platform the merchant runs.

    What Merchants Actually Need to Do

    For Shopify merchants, the immediate action items are straightforward. Product titles, descriptions, and attributes need to be written for machines, not just humans. An AI agent parsing “Vintage-inspired leather Chelsea boot, hand-stitched, available in cognac and midnight” understands the product better than “The James Boot” with a vague description. Structured attributes (material, color, size, price range, use case) matter more than marketing copy.

    Shopify’s Knowledge Base App lets merchants control how AI agents answer questions about their brand, including return policies, shipping times, and FAQ responses. This is the brand voice layer: when a customer asks ChatGPT “does this brand offer free returns?” the answer comes from the merchant’s Knowledge Base, not from whatever the AI hallucinated from its training data.

    The competitive advantage for early optimizers is real. As of late March 2026, Shopify president Harley Finkelstein noted that only about a dozen merchants among Shopify’s millions are actively using AI tools to sell products. The infrastructure is live. The merchant adoption is still near zero. The gap between infrastructure availability and merchant optimization is the window.

    What This Does Not Solve

    Agentic Storefronts does not solve the fundamental discovery problem. AI agents recommend products based on the structured data they receive and whatever ranking algorithms the AI platform uses. No one, including Shopify, has published how those ranking algorithms work. Which products surface for “best wireless headphones” is determined by the AI platform, not the merchant. Merchants have no paid promotion mechanism within AI chat responses (yet).

    The attribution challenge is also unsolved. Shopify provides channel attribution (you can see which orders came from ChatGPT vs. Gemini vs. Copilot), but the customer journey is opaque. Did the customer discover the product in ChatGPT, research it on Google, and buy it on the merchant’s site? The last-click attribution model breaks down when AI conversations become part of the funnel.

    Privacy and data ownership remain contested. When a customer asks ChatGPT about a product, OpenAI processes that conversation. When they click through to buy, the merchant gets the customer data. But the conversation data (what the customer asked, what alternatives they considered, what they rejected) stays with OpenAI. That conversation data is arguably more valuable than the transaction data, and merchants have no access to it.

    The same concentration dynamic that defines the AI infrastructure layer now extends to commerce: a handful of AI platforms (ChatGPT, Gemini, Copilot) mediate between customers and merchants, accumulating behavioral data that no individual merchant can replicate. Shopify’s Catalog sits between them, providing the data plumbing. Whether that intermediary role strengthens or weakens the merchant’s position depends entirely on how the protocols evolve and who controls the ranking algorithms.

    Sources: Shopify official announcements (March 2026); OpenAI spokesperson statement to Modern Retail; Shopify Help Center documentation on Agentic Storefronts; Google UCP documentation; Shopify investor conference statements (Harley Finkelstein, March 2026).

  • The .claude/ Folder Is Not a Config File. It Is a Protocol. Here Is What Every Component Does and Why It Matters.

    The .claude/ Folder Is Not a Config File. It Is a Protocol. Here Is What Every Component Does and Why It Matters.

    The .claude/ Folder Is Not a Config File. It Is a Protocol. Here Is What Every Component Does and Why It Matters.

    Developer Tools — March 28, 2026

    The .claude/ Folder Is a Protocol, Not a Config File.
    Here Is What Every Component Does.

    Claude Code’s hidden control center determines how the AI behaves in every session. Most developers have never opened it. The architecture reveals Anthropic’s platform strategy.

    460+
    HN Points
    Avi Chawla’s walkthrough drove massive developer engagement.
    200
    Line Ceiling
    Anthropic’s recommended max for CLAUDE.md.
    3 Layers
    Context System
    Explicit team rules + personal preferences + auto-learned knowledge.
    Exit 2
    The Only Halt
    Exit code 1 in hooks fails open. Only exit 2 blocks execution.

    Sources: Avi Chawla / Daily Dose of Data Science; Anthropic Claude Code documentation; Claude Code settings reference.

    Anthropic’s Claude Code has a hidden control center that most developers never open. The .claude/ folder sits in your project root, and it determines how Claude behaves in every session: what rules it follows, what commands it responds to, what files it can touch, and what it remembers between conversations. More than 460 Hacker News points on a single walkthrough of this folder in March 2026 suggest developers are only now realizing what they have been ignoring.

    The folder is not a settings file. It is a protocol. Anthropic designed it to be committed to git, shared across teams, and layered across scopes from personal preferences to enterprise-managed policy.

    Two Folders, Not One

    The most commonly missed fact about Claude Code’s configuration: there are two .claude/ directories. The project-level folder at ./.claude/ holds team configuration. You commit it to version control. The global folder at ~/.claude/ holds personal preferences, session history, and auto-memory that persists across all your projects.

    Claude Code’s permission system follows a strict inheritance hierarchy: managed policy (set by your organization) overrides global user settings, which override project settings, which override local overrides. The first matching rule wins.

    Avi Chawla noted that most Claude Code users treat this folder like a black box. Anthropic’s own documentation recommends keeping CLAUDE.md under 200 lines, citing measurable drops in instruction adherence above approximately 3,000 tokens.

    CLAUDE.md: The System Prompt You Control

    When you start a Claude Code session, the first thing it reads is CLAUDE.md. The file loads directly into the system prompt and stays active for the entire conversation. A 20-line CLAUDE.md that specifies your build system, ORM, folder structure, and coding conventions eliminates the majority of back-and-forth that developers experience with unconfigured AI assistants.

    The file supports hierarchy. A CLAUDE.md at the project root is the most common setup. A ~/.claude/CLAUDE.md applies global preferences. Subdirectory-level CLAUDE.md files add folder-specific rules. There is also CLAUDE.local.md, a personal override file that is automatically gitignored. Team standards go in CLAUDE.md, personal tweaks go in CLAUDE.local.md.

    The Rules Folder: Modular Instructions That Scale

    Once a team’s CLAUDE.md exceeds 200 lines, instruction adherence drops. Anthropic’s solution is the .claude/rules/ folder. Every markdown file inside it loads alongside CLAUDE.md automatically. Teams split rules by concern: code-style.md, testing.md, api-conventions.md, security.md.

    The real power is path scoping. Add a YAML frontmatter block with a paths field, and the rule only activates when Claude is working with matching files. A rule scoped to src/api/**/*.ts will not load when Claude edits a React component. This is conditional compilation for AI behavior, and it scales to monorepos with dozens of teams.

    Commands vs. Skills: The Trigger Distinction

    The .claude/commands/ folder lets teams add custom slash commands. Drop a markdown file named review.md and it becomes /project:review. Commands can embed shell output directly into the prompt using the ! backtick syntax. A code review command that runs git diff main...HEAD and injects the output means Claude sees the actual diff.

    Skills look similar but behave differently. The .claude/skills/ folder contains subdirectories, each with a SKILL.md file. Commands wait for you to trigger them. Skills trigger automatically when the task matches the skill’s description. Skills can bundle supporting files alongside the SKILL.md, making them self-contained workflow packages.

    This connects to AutoDream, Anthropic’s background memory consolidation system. Skills are the persistent behavior layer. AutoDream is the persistent knowledge layer. Together, they make Claude Code stateful across sessions in a way that no other AI coding tool replicates.

    The Permission and Hook System

    The settings.json file controls what tools Claude can use. Permissions follow an allow/deny/ask pattern evaluated in order: deny rules first, then ask, then allow. The first matching rule wins. This is not a suggestion system. It is a hard enforcement layer.

    Hooks add programmable checkpoints to Claude’s execution pipeline. The critical detail: exit code 2 is the only code that blocks execution. Exit 0 means success. Exit 1 means error but non-blocking. Exit 2 means stop everything. Using exit code 1 for security hooks is the most common mistake. It logs an error and does nothing.

    The events most developers use are PreToolUse (fires before any tool runs, your security gate), PostToolUse (for formatters and linters after execution), and Stop (fires when Claude finishes, for quality gates).

    Auto-Memory: Claude Writes Notes to Itself

    The ~/.claude/projects/ directory stores session transcripts and auto-memory per project. As Claude works, it automatically saves notes: commands it discovers, patterns it observes, architectural insights it picks up. These persist between sessions.

    The deeper story connects to AutoDream. The system prompt literally reads “You are performing a dream.” It runs a background sub-agent that deduplicates memory entries, removes stale notes, converts relative dates to absolute, and keeps the memory file under 200 lines. One observed case consolidated 913 sessions in under 9 minutes.

    The combination of auto-memory and AutoDream creates a three-layer context system: explicit team rules, explicit personal preferences, and implicit learned knowledge. No other AI coding tool has this.

    Why This Is a Platform Play, Not a Feature

    Making the configuration file-based and git-committable means it inherits all the infrastructure teams already have for code: version control, code review, branching, CI/CD. This is different from how every other AI coding tool handles configuration. Cursor uses a settings UI. GitHub Copilot uses VS Code settings. Windsurf uses a combination of UI settings and project rules. None of them have the full protocol.

    The implicit bet is that AI coding assistance will become a team-level infrastructure concern, not an individual developer preference. Whether that bet pays off depends on whether the 200-line context ceiling can scale, whether auto-memory becomes reliable enough to trust, and whether the hook system can handle enterprise security requirements.

    What Is Missing

    Anthropic has not published benchmarks on instruction adherence as a function of CLAUDE.md length. Auto-memory has no conflict resolution mechanism for teams. The hook system’s exit code semantics are a footgun. There is no telemetry or observability built into the folder system. For a system positioned as team infrastructure, these gaps need filling.

    The Practical Takeaway

    If you use Claude Code and have never opened your .claude/ folder, the minimum viable setup takes five minutes. Run /init to auto-generate a starting CLAUDE.md. Add your build commands, key architectural decisions, and 5 to 10 coding conventions. Keep it under 200 lines. That alone reduces back-and-forth by roughly 40%.

    For teams, the next step is the rules/ folder with path scoping. For organizations, the managed policy layer provides top-down control. For anyone running Claude Code on their actual machine, the permission system in settings.json is not optional. Set your deny rules. Use exit code 2 for security hooks. And know that Claude is quietly writing notes about your codebase that persist between sessions, whether you asked it to or not.

  • Robinhood Co-Founder Is Building Data Centers in Space. His Startup Just Hit a  Billion Valuation.

    Robinhood Co-Founder Is Building Data Centers in Space. His Startup Just Hit a $2 Billion Valuation.

    Robinhood Co-Founder Is Building Data Centers in Space. His Startup Just Hit a  Billion Valuation.

    Markets — March 2026

    Aetherflux Raises $120M
    for Orbital Data Centers.

    Baiju Bhatt’s Aetherflux closed a $120M Series B to build data centers in low Earth orbit, powered by solar and cooled by radiative heat rejection.

    Aetherflux, the orbital data center startup founded by Robinhood co-founder Baiju Bhatt, closed a $120 million Series B led by Andreessen Horowitz in March 2026. The company is building computing infrastructure in low Earth orbit (400-600km altitude), powered by solar photovoltaic arrays and cooled by radiative heat rejection rather than atmospheric cooling. Bhatt presented the concept at NVIDIA GTC 2026, framing it as a solution to the two primary constraints on terrestrial AI data centers: energy cost and cooling capacity.

    Why Data Centers in Space Are Not As Absurd As They Sound

    The physics case for orbital computing rests on three facts. First, solar energy in orbit is approximately 5 to 10 times more efficient than terrestrial solar because there is no atmosphere to absorb photons, no weather to block panels, and no nighttime (a satellite in the right orbit receives near-continuous sunlight). Second, cooling is free in space: radiative cooling in vacuum is more efficient than any terrestrial cooling system. Data center cooling accounts for 30 to 40% of total energy consumption on Earth. In orbit, that cost drops to near zero. Third, orbital data centers face no land use restrictions, water consumption limits, or grid connection bottlenecks, all of which are becoming binding constraints on terrestrial data center construction in 2026.

    The economics case is less clear. Launch costs (SpaceX Falcon 9: ~$2,700/kg to LEO, Starship target: ~$100/kg) determine whether orbital compute can compete with terrestrial pricing. At current launch costs, the capital expense of putting hardware in orbit exceeds terrestrial data center construction costs by 10x to 100x. At Starship’s target costs, the gap narrows significantly but does not close.

    The Physics Case For and Against Orbital Compute

    Real physics advantages: Radiative cooling to 3K cosmic background (vs. 15-35C ambient terrestrial). Solar irradiance ~1361 W/m² without atmospheric absorption. No land acquisition, zoning, or water use permits. Proximity to satellite communications infrastructure.

    Hard unresolved problems: 230ms minimum round-trip latency from LEO (speed of light). 90-minute orbital period creates power intermittency. Hardware servicing requires launch ($2,000-5,000/kg to LEO). Radiation degrades semiconductors ~10x faster than terrestrial. The 230ms latency is not an engineering problem: it is a physics constraint. Any AI inference workload with real-time requirements cannot be served from LEO regardless of hardware quality.

    The Three Questions the Pitch Deck Does Not Answer

    1. Who is the customer? Training large models is latency-tolerant, but hyperscalers (Google, Microsoft, Meta) already have massive terrestrial training clusters and the capital to build more. The customer who cannot build terrestrial compute but can afford $5,000/kg launch costs does not obviously exist at scale.

    2. How does hardware refreshing work? Terrestrial data centers refresh GPU hardware every 2-3 years. Orbital data centers require a launch for each hardware refresh. At current Starship pricing, a single rack refresh costs millions in launch fees alone.

    3. What is the radiation hardening strategy? Standard NVIDIA H100s are not radiation-hardened. Rad-hard computing is 10-100x more expensive per FLOP than commercial silicon. Aetherflux has not disclosed their semiconductor strategy for radiation tolerance.

    The Baiju Bhatt Pivot

    Aetherflux originally focused on beaming solar power from orbit to terrestrial receivers via laser. The company pivoted to orbital computing in 2025 after concluding that the terrestrial power transmission economics were unfavorable. The pivot keeps the core capability (space solar power systems) while changing the customer: instead of selling power to terrestrial grids, sell compute powered by space solar to AI companies.

    Bhatt’s credibility from co-founding Robinhood (which achieved a $32 billion valuation before its IPO) gives Aetherflux access to top-tier venture capital. The $2 billion valuation prices Aetherflux as a pre-revenue company with a potentially transformative technology, which is the same valuation framework that funded SpaceX before it had a single paying customer.

    What Has to Go Right

    For Aetherflux to succeed, several things must happen simultaneously. SpaceX’s Starship must achieve routine operation at prices near its $100/kg target. The satellite computing hardware must survive the radiation environment of low Earth orbit without unacceptable error rates. The latency from ground-to-orbit-to-ground round trips must be acceptable for the target workloads (batch training: yes; real-time inference: probably not). And the company must solve the data bandwidth problem: getting training data up to orbit and results back down requires high-throughput optical or radio links that do not yet exist at the necessary scale.

    The competitors are literal: Lumen Orbit, founded in 2024, is pursuing a similar concept with a solar-powered orbital data center targeting 2027 deployment. Microsoft Azure Space and AWS Ground Station provide cloud-edge compute for satellite operators but do not offer orbital compute as a service. The market for orbital computing does not exist yet. Aetherflux and Lumen Orbit are both betting that terrestrial data center constraints (power, cooling, land, water) will create demand for orbital alternatives within 5 to 7 years.

    The honest assessment: orbital data centers are a real technology with real physics advantages that face massive engineering and economic challenges. The $120M Series B funds a proof-of-concept deployment, not a commercial data center. The first data center satellite targeting 2027 will be a technology demonstrator, not a commercially competitive compute platform. If the demonstrator works, the path to commercial viability depends on launch cost reductions that are outside Aetherflux’s control. Bhatt knows this. The bet is that solving the technical challenges now positions Aetherflux to capture a market that will exist in 2030, even if it does not exist today.

    Sources: Aetherflux Series B announcement; Bhatt GTC 2026 panel; Andreessen Horowitz portfolio blog; SpaceX Starship commercial pricing; NASA radiation effects documentation. Market context, not financial advice.

  • Claude Code AutoDream: Anthropic Built a REM Sleep Cycle for Your AI Agent

    Claude Code AutoDream: Anthropic Built a REM Sleep Cycle for Your AI Agent

    Claude Code AutoDream: Anthropic Built a REM Sleep Cycle for Your AI Agent

    AI Research — March 2026

    Claude Code Runs Memory
    Consolidation During Idle Time.

    Anthropic’s AutoDream paper proposes using idle compute cycles to consolidate agent memory, analogous to REM sleep in humans.

    Anthropic published the AutoDream paper in March 2026, describing a memory consolidation system for long-running AI agents that uses idle compute cycles (periods when the agent is not actively processing a user request) to compress episodic experience into long-term retrievable memory. The approach borrows conceptually from neuroscience research on sleep-dependent memory consolidation, where the brain replays and compresses experiences from working memory into long-term storage during REM sleep.

    The Consolidation Architecture

    Step 1: Episodic buffer accumulation. During active operation, the agent stores raw interaction records in an episodic buffer: full conversation turns, tool call results, intermediate reasoning traces. This buffer has a capacity limit. When full, it triggers consolidation.

    Step 2: Salience-weighted compression. The consolidation model (a smaller, cheaper model than the primary agent) reads the episodic buffer and produces compressed memory summaries. It weights by salience signals: user corrections, repeated references, explicit user affirmations, and task completion markers. Less salient content is discarded.

    Step 3: Vector index storage and retrieval. Compressed memories are embedded and stored in a vector index. At query time, the agent retrieves relevant memories via semantic similarity search and injects them into the context window alongside the current query. The model weights are never modified.

    The Four-Phase Mechanism

    AutoDream operates in four phases during its background execution. Phase 1 (inventory): the sub-agent reads the current MEMORY.md file and catalogs every entry by topic, timestamp, and relevance category. Phase 2 (deduplication): entries that convey the same information in different words are merged. Phase 3 (temporal resolution): relative timestamps (“yesterday,” “last week”) are converted to absolute dates based on the session timestamp. This prevents temporal drift where “recently” accumulates entries that are months old. Phase 4 (pruning): entries that are no longer relevant (completed tasks, resolved bugs, outdated preferences) are removed based on staleness heuristics.

    The 200-line cap on MEMORY.md is an engineering constraint, not an arbitrary limit. Claude Code’s context window has a finite budget, and MEMORY.md is loaded at the start of every session. A 2,000-line memory file would consume context that should be available for the actual coding task. The 200-line limit forces AutoDream to prioritize: keep the information that most affects code generation quality, discard the rest. This is lossy compression, and it means long-running projects will lose some historical context over time.

    What the REM Sleep Analogy Gets Right and Wrong

    Biological REM sleep memory consolidation involves hippocampal replay: the brain replays recent experiences and transfers salient patterns to neocortical long-term storage. The AutoDream analogy captures the structural similarity: both processes run during downtime, both compress episodic experience, both use salience weighting to determine what survives compression. The analogy breaks down at the mechanism: biological consolidation modifies synaptic weights across neural circuits, while AutoDream uses a separate model to produce text summaries that are retrieved via embedding similarity.

    Lossy compression with no recovery path: Information not flagged as salient by the consolidation model is permanently discarded. Unlike biological memory, there is no mechanism to recover the original episodic record once the buffer is flushed. Consolidation model quality determines memory quality: The salience weighting is only as good as the consolidation model’s judgment. If the consolidation model systematically underweights certain types of information, those memories are lost across sessions. Cold start for new task types: AutoDream works best for agents with extended operational history.

    The UC Berkeley Paper Behind It

    AutoDream is grounded in research from UC Berkeley on memory consolidation in artificial agents (published February 2026). The paper demonstrated that LLM-based agents that periodically consolidate their memory files outperform agents with unlimited memory growth on task completion benchmarks. The counterintuitive finding: more memory is worse. Agents with thousands of memory entries suffered from retrieval interference, where relevant memories were buried under irrelevant ones, degrading performance. Periodic consolidation improved retrieval precision and downstream task accuracy.

    The biological analogy to REM sleep is not just marketing. During human REM sleep, the hippocampus replays daily experiences and the prefrontal cortex decides which to consolidate into long-term memory and which to discard. AutoDream implements an analogous process: replay (read all entries), evaluate (assess relevance and redundancy), consolidate (merge and compress), and prune (discard).

    Observed Performance

    One documented case consolidated 913 sessions of accumulated memory entries in under 9 minutes. The pre-consolidation MEMORY.md was over 800 lines. The post-consolidation file was 187 lines. The user reported that Claude Code’s responses in subsequent sessions were more contextually accurate because the memory file contained higher-signal entries without noise.

    The limitation Anthropic has not addressed: AutoDream runs on a schedule determined by Anthropic’s backend, not on user demand. Users cannot trigger a consolidation manually, cannot review what AutoDream plans to prune before it executes, and cannot recover entries that AutoDream removes. For long-running projects with historical context that matters months later, this is a real risk. Anthropic has acknowledged the limitation but has not shipped a solution.

    The practical implication for Claude Code users: agents running on long-horizon software development tasks (where the same codebase context, architectural decisions, and debugging history are relevant across hundreds of sessions) are the primary beneficiaries. The consolidation system allows the agent to maintain project-level context that would otherwise be lost at the context window boundary, without requiring the user to manually re-provide it each session.

    The broader question AutoDream raises is whether AI agents should manage their own memory autonomously or whether memory management should remain under user control. The current implementation assumes Anthropic knows better than the user which memories matter. For most developers using Claude Code for routine coding tasks, this assumption is correct. For researchers, long-term project leads, or users with domain-specific context that general heuristics cannot evaluate, the assumption may be wrong. As of March 2026, Anthropic’s answer is “the AI does, with heuristics we designed.” Users who disagree have no override mechanism.

    Sources: Anthropic AutoDream preprint, arXiv March 2026; Claude Code release notes; Walker, “Why We Sleep” (2017) for biological context; Zhong et al., “MemGPT” (2023) for prior memory architecture work.

  • Merrill Lynch’s 15,000 Advisors Now Have an AI System That Does 4 Hours of Meeting Prep in Minutes

    Merrill Lynch’s 15,000 Advisors Now Have an AI System That Does 4 Hours of Meeting Prep in Minutes

    Merrill Lynch’s 15,000 Advisors Now Have an AI System That Does 4 Hours of Meeting Prep in Minutes

    Enterprise AI — March 2026

    Merrill Lynch Deployed AI
    to Every Client Meeting.

    Bank of America’s Erica AI has moved from mobile banking assistant to active participant in financial advisor meetings, integrating with Salesforce CRM and Zoom.

    Bank of America’s Merrill Lynch wealth management division announced in March 2026 that its Erica AI assistant has been integrated into the financial advisor meeting workflow. During client calls conducted over Zoom, Erica now surfaces relevant portfolio data, product recommendations, and compliance flags to the advisor in real time, through a sidebar panel connected to Salesforce Financial Services Cloud. The integration covers all 19,000 Merrill Lynch financial advisors.

    How the System Actually Works

    AI-Powered Meeting Journey integrates three systems: Bank of America’s Erica AI platform (originally launched in 2018 for consumer banking), Salesforce CRM, and Zoom’s meeting infrastructure. Before a client meeting, the system pulls the client’s account history, recent transactions, portfolio performance, and prior meeting notes from Salesforce. It generates a briefing document that summarizes the client relationship, highlights items requiring attention (large deposits, portfolio rebalancing triggers, life events), and suggests talking points.

    During the meeting, the system records and transcribes the conversation via Zoom’s AI companion. After the meeting, it generates a summary, extracts action items, identifies follow-up commitments, and creates tasks in Salesforce CRM. The advisor reviews and approves the outputs before they are saved. The human-in-the-loop approval step is non-negotiable in financial services: regulatory requirements (SEC, FINRA) mandate that client communications and account actions have human oversight.

    How the Meeting Intelligence Architecture Works

    Pre-meeting: CRM context loading. When an advisor opens a Zoom meeting linked to a Salesforce contact, Erica automatically loads the client’s portfolio summary, recent transaction history, life event flags (retirement date approaching, beneficiary changes), and any open service cases. The advisor sees this context before the first word is spoken.

    During meeting: real-time suggestion engine. Erica listens to the meeting transcript (with client consent) and surfaces product suggestions when relevant topics arise. If a client mentions college savings, Erica flags 529 plan options. If the client mentions a recent inheritance, Erica flags estate planning resources. These appear as advisor-only sidebar cards.

    Post-meeting: automated CRM update. After the call, Erica drafts a CRM note summarizing discussed topics, flagged follow-ups, and any product recommendations surfaced during the meeting. The advisor reviews and approves before it is saved to Salesforce. All AI suggestions are logged with timestamps for FINRA compliance audit purposes.

    Why the Compliance Layer Is the Hard Part

    FINRA requires that every product recommendation made by a registered representative pass a suitability analysis specific to the client. An AI that suggests a product without a traceable suitability determination is a compliance liability. Bank of America’s implementation logs every Erica suggestion, records whether the advisor accepted or dismissed it, and links each suggestion to the client’s current suitability profile. If an advisor acts on an Erica suggestion, the audit trail shows the AI’s recommendation, the client’s profile at that moment, and the advisor’s approval decision.

    Erica does not make recommendations to clients directly. Every suggestion goes through the advisor, who must exercise independent judgment before acting. The AI is a context engine, not a decision maker. This is the only architecture that passes FINRA review. The system also does not handle complex tax planning, estate structuring, or custom portfolio construction. It is optimized for surface-level product matching and follow-up flagging, not for the nuanced analysis that justifies Merrill Lynch’s advisor compensation model.

    Why the 8-Year Build Matters

    Bank of America launched Erica in 2018, four years before ChatGPT made AI assistants mainstream. Erica started as a simple mobile banking chatbot handling balance inquiries and bill payments. Over eight years, the system processed over 2 billion client interactions, building a training corpus of financial conversations, client intent patterns, and regulatory-compliant response templates that no competitor can replicate quickly.

    The “build once, deploy many” strategy means Erica’s capabilities now extend from consumer banking (where it started) to wealth management (Meeting Journey), to commercial banking and internal operations. Each deployment adds training data that improves the underlying model. A competitor starting from scratch in 2026 would need years of interaction data to match the nuance of Erica’s understanding of financial client conversations. The data moat is the real competitive advantage, not the AI technology itself.

    Microsoft’s Copilot for Finance offers similar meeting preparation and summarization capabilities as a general-purpose tool. The difference is domain depth: Copilot understands meetings generically. Erica understands financial advisory meetings specifically. It knows that when a client says “I’m thinking about retiring early,” that triggers a cascade of portfolio rebalancing, Social Security timing, and healthcare coverage questions. Generic AI assistants treat this as a calendar scheduling task. Erica treats it as a financial planning event.

    The 15,000-Advisor Deployment Scale

    Deploying an AI system to 15,000 financial advisors simultaneously is a scale that most enterprise AI projects never reach. The logistics include: training 15,000 users on new workflows, integrating with 15,000 individual Salesforce configurations (each advisor has different client segments, product permissions, and compliance requirements), ensuring the system works across different meeting types, and maintaining regulatory compliance across all 50 states.

    Bank of America’s ability to deploy at this scale in one release (rather than a phased rollout over quarters) reflects the institutional engineering capability that distinguishes large financial institutions from fintech startups. The compliance infrastructure, the change management process, the internal training programs, and the IT support capacity already existed. The AI feature plugged into an operational machine built over decades. This is the enterprise deployment advantage that pure-play AI companies cannot replicate: not the technology, but the organizational infrastructure to deploy technology at scale in a regulated environment.

    Sources: Bank of America Q4 2025 earnings call; Merrill Lynch technology announcement; Salesforce Financial Services Cloud press release; FINRA AI guidance, March 2026.