
Researchers presented ToolHijacker at the Network and Distributed System Security Symposium on February 23, 2026 in San Diego. The paper (DOI 10.14722/ndss.2026.230675) describes the first prompt injection attack specifically designed to hijack the tool selection layer of LLM agents. The attacker inserts a single malicious tool document into a tool library. When any legitimate user query arrives, the agent’s two-step retrieval-then-selection pipeline picks the attacker’s tool instead of the correct one 96.7 percent of the time when the target model is GPT-4o and the shadow model used for optimization is Llama-3.3-70B.
The attacker does not need access to the target LLM, the retriever, the tool library layout, or the top-k setting. This is a no-box attack. The retrieval hit rate on MetaTool is 100 percent, which means the malicious document reaches the candidate set on every query. The authors then tested six published defenses: StruQ, SecAlign, known-answer detection, DataSentinel, perplexity detection, and perplexity windowed detection. Every one failed to stop the attack at a practical rate.
For an ecosystem where Model Context Protocol passed 97 million monthly SDK installs and tool marketplaces have become the dominant distribution layer for agent capabilities, this is the first empirical proof that tool-selection hijacking is an unsolved problem. Here is how the attack works, why the defenses fail, and what production MCP deployments can actually do about it today.
How ToolHijacker works
Authors Jiawen Shi, Zenghui Yuan, and colleagues formulate the attack as an optimization problem with two objectives. The malicious tool document must be retrieved into the candidate set during the retrieval phase, and then it must be selected by the LLM during the selection phase. The document is structured as two concatenated subsequences: a Retrieval-optimized sequence R, and a Selection-optimized sequence S.
R is optimized to maximize semantic similarity with target task descriptions. The attacker does not have the real task descriptions, so the paper reconstructs them through a shadow framework. The attacker builds a shadow tool library, a shadow retriever, a shadow LLM, and a set of shadow task descriptions drawn from the target domain’s vocabulary. An LLM is then prompted to synthesize R by extracting and combining the core functional elements of the shadow task descriptions. The generated text is not gradient-optimized, which means it looks linguistically natural and evades perplexity-based detection.
S is optimized to force the shadow LLM to select the malicious tool over benign alternatives, given that R has already caused the document to be retrieved. The paper evaluates two optimization methods. A gradient-based method uses HotFlip to mutate tokens toward maximum selection probability on open-weight shadow LLMs. A gradient-free method uses a Tree-of-Attack search strategy with an attacker LLM proposing candidate modifications iteratively. The gradient-free method works better against closed-source targets like GPT-4o. The gradient-based method works better against open-source targets like Llama-3-8B-Instruct.
Transferability is the critical property. The authors tested whether a document optimized against one shadow LLM attacks a different target LLM. It does. With Llama-3.3-70B as shadow and GPT-4o as target, the gradient-free variant achieves 96.7 percent attack success rate on MetaTool. With Claude-3.5-Sonnet as target, the success rate is similarly high. Semantic patterns learned by different retrieval models overlap enough that a single crafted R generalizes across architectures.
The test matrix covered 8 LLMs (Llama-2-7B-chat, Llama-3-8B-Instruct, Llama-3-70B-Instruct, Llama-3.3-70B-Instruct, Claude-3-Haiku, Claude-3.5-Sonnet, GPT-3.5, GPT-4o) and 4 retrievers across MetaTool and ToolBench benchmarks. The attack held across all combinations in the no-box setting.
Why every tested defense failed
Prevention-based defenses, StruQ and SecAlign, separate system prompts from user input structurally. They assume the attack surface is the user prompt. ToolHijacker’s malicious content lives inside a tool document that the retriever pulls into context. The document is not user input. Both defenses route around the attack rather than blocking it.
Detection-based defenses have four tested variants. Known-answer detection fails completely, with a 100 percent false negative rate against ToolHijacker. The detection method looks for signatures characteristic of canonical attacks. ToolHijacker’s shadow-framework approach produces documents that do not match any known-answer pattern. DataSentinel catches some malicious documents but misses the majority. Perplexity detection and perplexity windowed detection work better against gradient-based optimization because gradient descent on discrete tokens produces lower-fluency text. Both fail against the gradient-free variant, which uses an LLM to synthesize fluent natural-language attacks.
The pattern across all six defenses is a shared structural assumption: the attack surface is the prompt. Every defense was designed before tool-selection attacks were a studied class. ToolHijacker’s attack surface is the tool library itself, a location none of the defenses were built to monitor. The paper’s authors explicitly note that new defense strategies are needed and that the existing ecosystem is insufficient.
Why this matters for the MCP ecosystem
Model Context Protocol crossed 97 million monthly SDK downloads in March 2026, sixteen months after Anthropic introduced it. MCP tool servers are distributed through community marketplaces, vendor catalogs, and third-party plugin hubs. A compromised tool document in any reachable MCP server’s manifest can hijack every agent that retrieves it.
The precedent exists. OpenClaw’s skill marketplace has accumulated 1,184 confirmed malicious packages and 104 CVEs, and the structural problems driving that number are not patchable. North Korea’s Contagious Interview campaign has published 1,700+ malicious packages across five ecosystems, demonstrating that supply-chain injection into developer tooling is an active, ongoing operation. LiteLLM’s March 24 compromise by TeamPCP showed that credential-stealing payloads can ride unpinned dependencies into AI infrastructure.
ToolHijacker adds a new primitive to this threat model. The prior supply-chain attacks needed credential theft or code execution to monetize. ToolHijacker does not. The agent continues running its workflow. The user continues receiving what looks like legitimate output. Every decision simply routes through attacker-controlled tools, which means an attacker can extract information, poison outputs, or redirect actions without ever triggering a code-execution signal.
For developers building MCP-native products today, the implication is direct. Tool libraries need provenance verification. Tool documents need content auditing beyond signature checks. The retrieval-then-selection pipeline needs a middleware layer between retrieval and tool execution that cross-checks selected tool against expected task category. None of this exists in standard MCP client implementations as of April 2026.
Practical mitigations available today
The paper’s authors recommend four measures. First, restrict tool libraries to vetted and cryptographically signed sources, which turns an open marketplace into a closed-gate distribution. Second, monitor tool descriptions for anomalies using ensemble detection that combines multiple signals rather than any single filter. Third, log and audit tool invocation patterns in production and alert on abnormal selection distributions, which catches attacks that succeed in the lab but produce tell-tale behavioral signatures in deployed systems. Fourth, treat any tool library that accepts third-party submissions as untrusted input, regardless of the maintainer’s reputation.
Meta’s Agents Rule of Two, published on October 31, 2025, offers the most conservative operational mitigation. No single agent session should combine all three properties simultaneously: access to private data, exposure to untrusted content, and the ability to take externally-observable state-changing actions. ToolHijacker attacks the second property, so the defense is to constrain the first and third. An agent that reads untrusted tool documents should not also have access to user credentials or the ability to send emails. This is coarse but implementable today, and it does not require waiting for a ToolHijacker-specific defense.
For production systems that cannot avoid combining all three properties, a second-pass verification layer is feasible. After the LLM selects a tool, a separate check compares the selected tool’s category and parameters against the expected task category. If the user asked to summarize an email and the selected tool is a file-write operation, block the call and log the anomaly. This does not solve the problem but it catches the most obvious attacks.
What this means for agent marketplace governance
The structural assumption underlying MCP, OpenClaw’s skill registry, and every tool-hub distribution model is that tool authors are identifiable and that malicious tools can be removed when discovered. ToolHijacker breaks both halves of that assumption. A malicious tool document can be crafted by an attacker who never publishes a tool through normal channels. It can be slipped into a legitimate repository by compromising any contributor account. And because the attack signal is semantic (the document reads like a useful tool description), static scanning of package contents does not flag it.
Marketplace operators have three options. First, require cryptographic signing by identity-verified tool authors, which raises the attacker’s cost but does not stop insider attacks. Second, implement runtime selection auditing that compares tool selection patterns across users and flags outliers, which catches attacks in production but does not prevent first-use impact. Third, move from open marketplaces to curated catalogs with human review on every submission, which trades ecosystem velocity for security. None of these are trivial to implement. All of them are likely to be mandated by enterprise customers within twelve months.
Limitations the paper acknowledges
Evaluation ran on MetaTool and ToolBench benchmarks, not on production MCP deployments. Real-world tool curation, rate limiting, and output validation may reduce attack success in ways the paper does not measure. The shadow-framework reconstruction requires some knowledge of the target domain’s task description distribution, so attacks on narrow, proprietary, or highly-specialized agent workflows may be harder to craft than attacks on general-purpose agents.
Adaptive targets that retrain regularly or rotate tool libraries may exhibit different vulnerability profiles. The paper does not test ToolHijacker against models equipped with activation-level defenses. Concurrent research, including architecture-level isolation approaches similar to Apple’s Private Cloud Compute, may offer mitigation paths the paper does not address.
What happens next
The NDSS 2026 publication will push tool-selection security onto the OWASP LLM Top 10 in the 2026 or 2027 revision. Concurrent work signals a research pivot from prompt-level attacks to tool-level attacks. Faghih et al. 2025 showed that suffix appending to tool descriptions is enough to bias selection. Beurer-Kellner and Fischer 2025 demonstrated that MCP tool descriptions can influence other tools’ behavior through cross-tool prompt injection. The Log-To-Leak paper published on OpenReview in October 2025 demonstrated covert data exfiltration through tool invocation decisions, even when the agent’s output looks normal. The Synthetic Web Benchmark showed that a single adversarial document can collapse frontier AI agent accuracy to zero, and tool hijacking is the logical next step from document hijacking.
The defensive gap will close. Activation-level detection, verified tool registries, and tool-behavior attestation are all plausible research directions. But closing the gap will take months, and the research-to-production lag for security tooling in AI infrastructure is historically 12 to 24 months. In the meantime, every MCP-native agent product shipping today operates with a class of vulnerability that no major vendor has a deployed countermeasure against. The question is not whether ToolHijacker-style attacks will appear in the wild. The question is how quickly the first documented production incident surfaces, and which MCP marketplace is the vector.


















You must be logged in to post a comment.