98.4% of Claude Code Is Operational Infrastructure. A New arXiv Paper Maps All of It.

98.4 percent of Claude Code is not the AI. It is the operational infrastructure around it. That figure comes from a formal source-code analysis of the 512,000-line TypeScript codebase, published on arXiv on April 14, 2026 by Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen of VILA Lab. The paper, “Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems,” maps every major subsystem and traces each one back to a design decision. The finding reframes what it means to pick an agentic coding tool. You are not choosing an AI. You are choosing one orchestration layer over another.

The paper arrived four weeks after a misconfigured npm release accidentally shipped Claude Code’s source maps to anyone who installed it. The VILA Lab researchers did not rely on those leaked artifacts. Their analysis is reproducible from the public TypeScript source. But the leak exposed something the formal paper confirms from a different angle: 44 unreleased features behind compile-time feature flags, including an autonomous daemon mode codenamed KAIROS and a background planning system called ULTRAPLAN. The product you install is a small subset of what the engineering investment actually covers.

The While Loop and the 98.4 Percent

The core of Claude Code is a while loop: call the model, run whatever tools the model requests, feed results back, repeat until the model produces a response with no tool calls. That loop fits in a few lines of TypeScript. The paper’s contribution is tracing where the other 98.4 percent lives.

Those lines decompose into seven layered components: User, Interfaces, Agent Loop, Permission System, Tools, State and Persistence, and Execution Environment. The Agent Loop manages the turn-by-turn interaction with the model API. The Permission System decides what the loop is allowed to do. The Tools layer defines what actions are possible. State and Persistence handles what survives between turns and between sessions. Each layer answers a specific design question, and the paper identifies five human values embedded across them: human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability. Those values produce thirteen design principles. The thirteen principles produce specific implementation choices. That cascade is what makes the analysis useful.

The 17 Percent Comprehension Decline

Embedded in the paper’s discussion of limitations is a finding that deserves its own moment. Early evidence cited in the analysis shows developers working in AI-assisted conditions score 17 percent lower on code comprehension tests than developers working without assistance. The paper does not claim this is caused by Claude Code specifically. It is a concern about the architectural pattern of delegating cognition to an agent that operates under context constraints the developer cannot fully see.

The logic is mechanical. The compaction pipeline is lossy by design. Subagent boundaries prevent any single agent from holding a coherent global view of a large repository. The agent frequently operates without full codebase awareness, and the developer increasingly operates through the agent rather than through the code. Short-term velocity goes up. Long-term mastery goes down. The paper treats this as an open design question, not a verdict, but it is the first time a production analysis has named the tradeoff in a citable way.

The Permission System: Seven Modes and an ML Classifier

Most coverage of Claude Code’s safety model stops at “it asks before doing dangerous things.” The actual system has seven distinct permission modes and a machine-learning classifier that participates in decisions when enabled.

The seven modes run from fully manual, where the user approves every action, to fully automatic, where pre-configured rules handle all decisions without interruption. The default sits in the middle: read-only operations auto-approve, write operations require confirmation, shell execution of unfamiliar commands requires explicit consent. The deny-first gate means anything not explicitly permitted is blocked by default.

The ML-based classifier, called yoloClassifier internally, engages when TRANSCRIPT_CLASSIFIER is enabled. It loads three prompt resources at runtime: a base system prompt, an external permissions template, and for Anthropic-internal users, a separate internal template. The classifier reads the conversation transcript and the proposed tool call and returns a risk assessment. A denial is not a hard stop. The model receives the denial reason and is expected to propose a safer alternative on the next loop iteration.

The most uncomfortable finding about permissions is the 93 percent approval rate. Users approve approximately 93 percent of permission prompts without inspection. This produces approval fatigue: the safety architecture relies on human decision authority, but when approval becomes a habitual reflex, the authority is functionally absent. When Anthropic’s own user research surfaced this pattern, the engineering response was to restructure permission boundaries so fewer prompts appear, rather than add more warnings. Fewer, higher-stakes prompts outperform frequent low-stakes ones. That is a defensible choice. It is also an admission that human-in-the-loop safety models degrade under load.

PreToolUse hooks can modify permission decisions before the user dialog appears. PermissionRequest hooks can resolve decisions asynchronously in coordinator mode. This extensibility means permission logic is not sealed inside the orchestration layer. External code can observe and react to permission events, which is useful for enterprise policy enforcement and also creates an attack surface the paper flags as a pre-trust initialization risk in early security analyses.

Five-Layer Compaction: Context as the Bottleneck

Claude Code’s context window holds the system prompt, CLAUDE.md hierarchy, auto memory, conversation history, file reads, command outputs, tool results, and subagent summaries. In a multi-hour coding session with extensive file operations, this fills the available window. The compaction pipeline is what keeps the agent functional as sessions grow.

The five layers apply cheapest-first. Layer one is the Tool Result Budget: oversized tool outputs, such as file reads spanning thousands of lines or long grep results, are trimmed to a configured maximum before they enter the conversation history. Lossless for the session’s task completion, zero API cost.

Layer two is Snip Compact: older messages are discarded wholesale to free context space. No summarization, no analysis, just deletion. Information loss is high but cost is zero. Gated behind the HISTORY_SNIP feature flag and primarily used in headless sessions where the user is not watching the terminal.

Layer three is Microcompact: individual tool results within the cached prompt prefix are cleared selectively. This preserves recent context while freeing space from older operations. Cache-aware logic pins tool results that fall within the cached prefix region, because clearing them would invalidate downstream cache entries and produce a net cost increase rather than a savings.

Layer four is full context summarization via an API call: the model reads the conversation history and produces a compressed summary that replaces the original content. Preserves semantic information at the cost of one additional API call. Auto-compaction triggers at approximately 92 percent context window usage.

Layer five is context collapse, a last resort for sessions where a single large file or tool output refills the context immediately after each summarization. After a few failed compaction attempts, the system stops and surfaces an error rather than entering an infinite summarization loop. This is a deliberate engineering choice. An agent that silently loops on context management produces costs without work. An agent that surfaces an error preserves user awareness.

Beyond compaction, several subsystem choices address context scarcity. CLAUDE.md files load lazily: nested-directory instruction files load only when the agent reads files in those directories. MCP tool schemas are deferred by default, with only tool names loaded at session start and full schemas pulled on demand via ToolSearch. Subagents return only summary text to the parent session, not their full conversation history. Each choice answers the same question: what needs to be in context right now, versus what can be reconstructed on demand?

The Scaffolding Thesis: What the 98.4 Percent Means

The paper makes a direct claim: “as models converge in capability, the scaffolding becomes the differentiator.” The 98.4 percent figure is the quantitative argument. If 98.4 percent of the system is not model weights, then 98.4 percent of the engineering investment is in orchestration, context management, permissions, and persistence. Two agents running identical models on identical tasks can produce substantially different results if their scaffolding differs.

The paper tests this claim by comparing Claude Code against OpenClaw, an independent open-source agent system that answers many of the same architectural questions from a different deployment context. Claude Code uses per-action safety classification through deny-first gates and ML classifiers. OpenClaw uses perimeter-level access control. Claude Code manages context through a five-layer compaction pipeline. OpenClaw uses gateway-wide capability registration. Both are defensible designs. They reflect different deployment realities: a local CLI tool with single-repository scope versus an embedded runtime within a gateway control plane.

The paper cites an Anthropic internal survey of 132 engineers and researchers that found approximately 27 percent of Claude Code-assisted tasks were work the user would not have attempted without the tool. Not faster. Not cheaper. Not attempted at all. This suggests the architecture enables qualitatively new workflows, which is a different claim than productivity improvement. Whether that pattern holds across a broader population than 132 Anthropic employees is an open question the paper notes.

Four Extensibility Mechanisms and the Pre-Trust Initialization Risk

Claude Code’s extensibility has four mechanisms: MCP, plugins, skills, and hooks. Each injects into the agent loop at a different point, and each injection point is also a potential attack surface.

MCP is the primary external tool integration path. Servers are configured from four scopes: project, user, local, and enterprise, with additional plugin and claude.ai servers merged at runtime from services/mcp/config.ts. The MCP client supports multiple transport types including stdio, SSE, HTTP, WebSocket, SDK, and IDE-specific variants. Connected servers contribute tool definitions as MCPTool objects. The tool search mechanism means MCP tool names appear in context at session start, with full schemas loaded only when the agent needs a specific tool.

Plugins extend the tool registry and context setup at initialization. Skills are markdown files loaded into context that define specific capabilities, with the agent reading skill descriptions at session start but loading full content on demand. Hooks intercept the agent loop at designated points: PreToolUse, PostToolUse, PermissionRequest, and PermissionDenied. External code registered at these hooks can modify behavior, enforce policies, or trigger side effects.

The paper flags a specific early security finding about initialization order: hooks and MCP servers initialized before the deny-first safety pipeline was fully engaged, creating a window where extension code ran in a pre-trust state. This has been addressed in subsequent releases, but the principle generalizes. Every injection point is a potential attack surface, and the timing of when safety checks come online is its own design decision. The ToolHijacker research documented exactly this class of vulnerability at the tool selection layer, with a 96.7 percent attack success rate on GPT-4o.

Session Persistence: Append-Only by Design

Session storage is append-only JSONL. Conversations are never destructively edited on disk. When a session is resumed or forked, the system rebuilds the conversation history using preserved boundary metadata stored as UUID chains. The design prioritizes auditability over query power. Every interaction is recoverable. Nothing is silently overwritten.

The permission model does not persist across sessions. Session-scoped permissions are not serialized when a session closes. When a session resumes, trust must be re-established. This is a deliberate safety-conservative choice: preventing stale authorizations from migrating into a modified codebase is worth the friction of re-approving permissions in a new session. The file-history checkpoint system, stored at ~/.claude/file-history/sessionId, enables –rewind-files rollback independent of the session permission state.

Limitations and the Six Open Design Directions

The bounded context window and lossy compaction pipeline mean the agent frequently operates without full codebase awareness. The subagent isolation that keeps context costs manageable also prevents any single agent from holding a coherent global view. The 17 percent comprehension decline finding points at the downstream cost of this design: the developer increasingly operates through the agent rather than through the code.

The approval fatigue finding points to a structural tension the paper does not resolve. Human decision authority is a stated design value. The implementation relies on users making meaningful approval decisions. At 93 percent approval rates, that assumption does not hold in practice. The paper documents this as an open design question.

The six open design directions the paper identifies for future agent systems: better mechanisms for preserving long-horizon codebase coherence, improved context management that prioritizes semantically important content over recency, permission models that adapt to demonstrated trustworthiness over time, and better support for parallel workstreams without the information isolation that subagent boundaries impose. None of these are solved. All are active research problems.

What This Means for Developers Building on Agent Infrastructure

The paper’s practical contribution is a vocabulary for evaluating agent scaffolding. The five-layer compaction pipeline, the seven-mode permission system, the four extensibility injection points, and the append-only session storage are not Claude Code curiosities. They are design patterns every production agent system must resolve one way or another. Understanding how Claude Code resolves them is useful whether you are using Claude Code directly, building on top of it, or designing a competing system.

The finding that scaffolding is the product generalizes. Changing only the edit tool format across 16 models lifted Grok Code Fast 1 from 6.7 percent to 68.3 percent on a coding benchmark. No model weights changed. The scaffolding change was the variable. The MCPShield analysis confirms the same dynamic from a security perspective: protocol-level controls around the model determine what the model can and cannot do more than the model’s internal capabilities.

The competitive implication for the industry is uncomfortable. If 98.4 percent of the engineering value is in scaffolding rather than weights, then the moat around Claude Code is TypeScript, not training runs. That is reproducible by any team willing to invest. The 44 unreleased features behind feature flags, including KAIROS and ULTRAPLAN, are Anthropic’s current lead. The leak exposed their names but not their code. The race to ship equivalent scaffolding starts now.

The full paper is available at arxiv.org/abs/2604.14228. The companion GitHub repository at VILA-Lab/Dive-into-Claude-Code maintains a living architecture reference with community-contributed analyses.

98.4% of Claude Code Is Operational Infrastructure. A New arXiv Paper Maps All of It.

The While Loop and the 98.4 Percent

The 17 Percent Comprehension Decline

The Permission System: Seven Modes and an ML Classifier

Five-Layer Compaction: Context as the Bottleneck

The Scaffolding Thesis: What the 98.4 Percent Means

Four Extensibility Mechanisms and the Pre-Trust Initialization Risk

Session Persistence: Append-Only by Design

Limitations and the Six Open Design Directions

What This Means for Developers Building on Agent Infrastructure

Share this:

Like this:

More posts

MITRE ATLAS: The ATT&CK Framework for AI Systems

Neural Backdoor Attacks: From BadNets to LLM Trojans

LLM Watermarking: How Models Embed Detection Signals in Their Outputs

Differential Privacy for LLMs: The Training Privacy Guarantee

Discover more from My Written Word