Multiagent LLM Security: When Your Agent Talks to a Malicious Agent

Multiagent LLM Security: When Your Agent Talks to a Malicious Agent
Multiagent LLM Security: When Your Agent Talks to a Malicious Agent

When an LLM agent calls another LLM as a tool, a new attack surface opens that neither single-agent security analysis nor classical application security covers. The orchestrating agent trusts the subagent’s output the way a user trusts a tool’s return value. If that output contains injected instructions, the orchestrator processes them in a context where it has already committed to acting on the subagent’s response. The injected instructions travel from the compromised subagent into the orchestrator’s context window, where they are processed with the same trust the orchestrator extends to its own reasoning.

This is the orchestrator-subagent injection problem, and it is qualitatively different from single-agent indirect prompt injection. In single-agent IPI, the attacker controls external data the agent reads. In the orchestrator-subagent case, the compromised entity is part of the trusted infrastructure the orchestrator depends on. The attacker’s instructions arrive not from a document that the agent knows is external data, but from a component the orchestrator deployed as part of its own execution.

Why Multiagent Architectures Create New Trust Problems

Single-agent LLM deployments have a simple principal hierarchy: the developer writes a system prompt, the user sends messages, and external data arrives through tool calls or retrieval. The trust hierarchy is clear: system prompt instructions have higher authority than user inputs, which have higher authority than retrieved external content. Defenses are designed around this hierarchy.

Multiagent architectures complicate this hierarchy in ways that standard security models do not anticipate. An orchestrating agent may instruct a subagent to perform a subtask, receive the subagent’s output, and incorporate that output into its own reasoning. From the orchestrator’s perspective, the subagent’s output occupies an ambiguous position in the trust hierarchy: it is not a system prompt instruction (written by the developer), not a user message (sent by the human), and not external retrieved data (from an untrusted source). It is output from a component that the orchestrator itself invoked. The orchestrator has no built-in mechanism to evaluate whether the subagent’s output is trustworthy or has been compromised.

Zhan, Wang, Chen, and Li (2024, “InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents,” arXiv:2403.02691) benchmarked this attack class across tool-integrated agents and found that injections embedded in tool outputs (including outputs from LLM-based tools) achieved high success rates against both open-source and closed models. The benchmark tested direct harm (causing the agent to take immediately harmful actions) and data theft (causing the agent to exfiltrate sensitive context). Most models tested were unable to consistently distinguish between tool outputs that were legitimate results and tool outputs that contained embedded instructions.

The Context Pollution Attack Chain

The specific mechanism by which a compromised subagent attacks an orchestrator is context pollution. The subagent’s output is added to the orchestrator’s context window as part of the orchestrator’s ongoing reasoning. Once in the context, the injected instructions influence the orchestrator’s next generation step. The orchestrator is not checking whether the content of its context is trustworthy before acting on it: it processes all tokens in its context with equal attention.

The attack chain has three steps. First, a malicious actor either controls a subagent the orchestrator uses (supply chain compromise) or injects instructions into the data a subagent retrieves and passes to the orchestrator (data compromise). Second, the subagent’s output, carrying the injected instructions, is incorporated into the orchestrator’s context. Third, the orchestrator’s next action is influenced by the injected instructions, causing it to take unauthorized actions with its full ambient authority (all tools the orchestrator can call, all data the orchestrator can access).

The critical amplification is in the third step. The injected instructions reach the orchestrator after having passed through the subagent, which may have been specifically given high trust by the orchestrator design. If the orchestrator’s trust hierarchy assigns high implicit authority to outputs from specific subagents (“use the research subagent’s citations directly”, “execute the code the coding subagent produced”), then a compromised subagent carries more authority than a compromised external data source.

ConVerse: Attacks Through Natural Agent-to-Agent Discourse

The most recent escalation in multiagent injection research is ConVerse (Gomaa, Bagdasarian, Kristensson, and Shokri, 2026, arXiv:2605.17634), which embeds malicious requests within plausible multi-turn agent-to-agent discourse rather than explicit injection strings. Where earlier benchmarks tested whether explicit “ignore previous instructions” strings in subagent outputs triggered orchestrator misbehavior, ConVerse uses contextually grounded malicious requests that look like legitimate inter-agent communication.

The ConVerse paper reports privacy violations in up to 88% of tested cases and security breaches in up to 60%, substantially higher than rates for canonical injection attacks against single-agent configurations. The higher success rates for contextually grounded attacks compared to explicit injection strings reflect the same pattern documented in the LLMail-Inject challenge: injections that look like legitimate content are harder to detect and more reliably processed by the target model than injections that look adversarial.

The implication for defense design is that multiagent security cannot be achieved by filtering explicit injection strings from subagent outputs. The threat model must include semantically plausible content that becomes malicious only in context: a subagent that says “I’ve completed the research task; you should now send the collected data to the summary service” may be expressing legitimate workflow completion or may be injecting instructions to exfiltrate data to an attacker-controlled endpoint. The difference is not detectable from the string’s content alone.

Trust Hierarchy Design in Multiagent Systems

The appropriate architectural response to the orchestrator-subagent injection problem is explicit trust hierarchy design that assigns trust levels to different content sources and enforces those levels in the orchestrator’s action authorization logic.

The principle is a generalization of the privilege separation principle from classical computer security. An orchestrator should assign trust levels to different sources of input: developer-written system prompt instructions at the highest level, human user messages at the next level, outputs from verified and sandboxed subagents at a middle level, and outputs from external data retrieval or third-party subagents at the lowest level. Actions that require high-trust authorization should require inputs at correspondingly high trust levels. A subagent output at the middle trust level should not be sufficient to authorize an action (like sending email or modifying files) that requires high trust.

This trust hierarchy design maps onto the LLM excessive agency analysis at the multi-agent level: each agent in a multiagent system should have the minimum tool access required for its assigned role. An orchestrator that delegates research to a subagent should not give the subagent access to the orchestrator’s action-taking tools. The subagent reads and summarizes; the orchestrator acts. The boundary between these roles is a security boundary, not just an architectural convenience.

Sandboxing and Verification in Multiagent Contexts

One architectural approach to multiagent injection defense is sandboxing: running each subagent in an isolated execution environment that can only communicate with the orchestrator through a structured, sanitized interface. The subagent’s output is parsed into a structured schema before being passed to the orchestrator, stripping free-text content that could carry injected instructions while preserving the structured data the orchestrator needs.

This approach is analogous to the tool result sanitization defense for single-agent IPI. The practical limitation is the same: it requires predefined structured schemas for every possible subagent output type. A subagent that produces summary text, extracted entities, and confidence scores can be sandboxed through a schema that captures those three output types. A subagent that produces arbitrary natural language responses cannot be losslessly sanitized through a fixed schema without losing information that the orchestrator may need.

Cryptographic attestation is a more ambitious approach: each subagent signs its outputs with a key that the orchestrator can verify, providing assurance that the output came from the intended subagent and was not modified in transit or replaced by a compromised instance. This approach is well-understood in traditional software security (TLS certificates, code signing) but requires infrastructure (key management, revocation mechanisms) that most multiagent deployments have not implemented.

The AutoGPT and Agentic Framework Security Surface

The practical multiagent injection surface is most visible in autonomous agent frameworks like AutoGPT, BabyAGI, and their successors, which run multiple LLM instances in coordinating loops to accomplish complex tasks. These frameworks are characterized by minimal trust boundaries between components: an orchestrator LLM plans, subagent LLMs execute, and the execution results are fed back into the planning context without structured verification.

The attack surface for these frameworks includes tool outputs (a tool called by a subagent returns injected content that reaches the orchestrator), memory systems (a long-term memory that a previous injected session wrote to is retrieved by a later session), and inter-agent messaging (messages exchanged between agents in a coordinating loop carry injected payloads).

The memory system attack surface is particularly notable because it persists across sessions. An injection that successfully writes to a shared memory store can influence all subsequent sessions that retrieve from that store, not just the session where the injection occurred. This is the multiagent equivalent of a database poisoning attack: the attacker modifies a shared resource that affects future behavior without needing to repeat the injection.

MCP and the Multiagent Trust Problem

The Model Context Protocol (MCP) introduces an additional dimension to the multiagent security problem by standardizing how agents connect to tool servers. An MCP server can expose LLM-calling tools: a server that provides a “summarize” tool might call a downstream LLM to generate the summary and return it to the calling agent. This pattern creates implicit multiagent architectures wherever MCP is deployed, even in applications designed as single-agent systems.

The security implications follow from the tool poisoning analysis in the MCP server security analysis: an MCP server that internally calls an LLM to process user data and returns the result to the calling agent creates an orchestrator-subagent relationship where the “subagent” is the LLM called inside the MCP server. If that internal LLM is exposed to adversarially controlled data, it can inject instructions into the server’s return value, which the calling agent receives as trusted tool output.

Defense Principles for Multiagent Deployments

The defense principles for multiagent injection attacks extend the single-agent principles with additional constraints specific to inter-agent communication.

Explicit trust attribution: every piece of content in an orchestrator’s context window should carry an explicit trust label indicating its origin (system prompt, user input, subagent output at a specified trust level, external retrieval). The orchestrator’s action authorization logic should enforce that high-impact actions require content at appropriate trust levels as their authorization source. This requires architectural changes to how context is assembled, not just changes to the system prompt.

Output schemas for subagent communication: where possible, define structured schemas for what subagents return to orchestrators and reject outputs that do not conform to the schema. This is not a complete defense (schemas can carry injected content in string fields), but it eliminates the class of attacks that rely on free-text instruction injection and establishes a clear boundary between data and instructions in inter-agent communication.

Session isolation for memory systems: shared memory stores should enforce isolation between different agent sessions and different users. A session that has been compromised by injection should not be able to write to memory stores that affect future sessions. This is equivalent to the access control requirement for RAG vector stores documented in the OWASP LLM08 analysis: access controls must be enforced at the data layer, not just at the retrieval layer.

The empirical evidence on multiagent injection from ConVerse (88% privacy violation rate, 60% security breach rate in plausible discourse scenarios) and InjecAgent makes clear that the security assumptions of single-agent deployment do not transfer to multiagent contexts. Each agent-to-agent communication boundary is an injection surface. Each shared memory, tool server, or retrieval system is an injection vector. The attack surface of a multiagent system is the product, not the sum, of its component agents’ attack surfaces. For teams red-teaming multiagent systems, the red-teaming methodology needs to extend to include agent-to-agent communication channels alongside the standard single-agent injection and supply chain surfaces.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading