LLMail-Inject: What 208K Attacks Against an Email Agent Found

LLMail-Inject: What 208K Attacks Against an Email Agent Found
LLMail-Inject: What 208K Attacks Against an Email Agent Found

Email is the highest-value indirect prompt injection surface in enterprise AI deployments. An email assistant with access to a user’s inbox, calendar, and the ability to send messages on the user’s behalf represents an agent with ambient capabilities that an attacker can exploit without any direct interaction with the user’s systems. The attacker sends an email. The email contains injected instructions. The user’s email assistant reads the email, processes the injected instructions as authoritative, and carries out actions the user never requested.

This is not a hypothetical scenario. Microsoft Research, Saarland University, Imperial College London, and collaborators ran a controlled public challenge to measure exactly how difficult this attack is to pull off in a realistic deployment with state-of-the-art defenses in place. The result was LLMail-Inject (arXiv:2506.09956, June 2025), the first large-scale public challenge simulating a realistic email-assistant environment. The challenge produced 208,095 unique attack submissions from 839 participants across multiple LLM architectures and retrieval configurations. The findings revealed a stark gap between perceived and actual resilience.

The Experimental Setup

The LLMail service, built by Abdelnabi, Fay, Pannell, Cherubin, Salem, Paverd, and collaborators at Microsoft Research, simulated a realistic email assistant deployment. The assistant had access to a user’s email inbox and could perform actions on behalf of the user, including sending emails. The assistant used a retrieval-augmented generation architecture: a user query caused the system to retrieve relevant emails from the inbox, and the retrieved emails were passed to the LLM as context for generating a response or taking an action.

Participants took the role of an attacker who could send a single email to the target user. The attacker’s goal was to cause the user’s LLM to perform a specific unauthorized action, such as sending an email to an attacker-controlled address, without the user requesting that action. Success required the entire end-to-end chain: the injected email had to be retrieved by the RAG system, the injection had to evade the deployed defenses, the LLM had to process the injection and generate the appropriate tool call with correct formatting, and the contextual data had to be exfiltrated as specified.

This end-to-end requirement is what distinguishes LLMail-Inject from prior injection benchmarks. AgentDojo and InjecAgent measured resilience against attacks embedded in tool outputs where success was typically a policy-violating tool call triggered by an explicit adversarial string. LLMail-Inject required participants to beat the retrieval system, the defenses, the model’s alignment training, and the tool-call formatting requirements simultaneously. Each of these barriers might reduce attack success rates independently; the question was what happened when all were active together.

The Defense Stack

The challenge evaluated multiple defense configurations at increasing levels of sophistication. Understanding what each defense does and how participants circumvented it is the empirically most valuable output of the challenge.

The no-defense baseline configuration ran the email assistant with no injection-specific protections. This established the baseline attack success rate against LLMs’ alignment training alone, without any application-layer defenses. Alignment training provides some resistance to explicit instruction overrides but provides no resistance to semantically indirect attacks that look like legitimate email content.

Spotlighting, developed by Hines, Lopez, Hall, Zarfati, Zunger, and Kiciman at Microsoft (2024), marks untrusted content (the retrieved emails) with specific delimiters and instructs the model not to execute instructions found within marked content. The LLMail-Inject implementation used query and data delimiters containing random numbers to wrap the user’s query and retrieved emails, and a random short string to separate words within marked sections. The exact delimiters and strings were not disclosed to participants, so attack strategies had to work across unknown delimiter configurations.

The text classifier defense used a fine-tuned classifier to detect injection attempts in email content before passing them to the LLM. Thresholds were chosen to maintain a False Positive Rate (FPR) below 5% on the test set, ensuring the defense did not substantially degrade the assistant’s usefulness for legitimate emails.

The internal-state classifier defense, based on Abdelnabi et al. (2025), classified injections using the LLM’s internal activation patterns rather than the text of the injection. Activation-based detection exploits the finding that a model processing an injection attempt exhibits different activation patterns than a model processing legitimate content, even when the surface text of the injection looks benign. The classifier was trained on these activation differences and applied at inference time.

What 839 Participants Found

The 208,095 unique attack submissions produced an empirically grounded picture of what adaptive human attackers achieve against each defense configuration. The key finding, stated in the paper’s abstract and confirmed by the submission distribution, is a stark gap between perceived and actual resilience: state-of-the-art models achieve high success rates at resisting naive attacks but are substantially more vulnerable to iteratively refined attacks from motivated adversaries.

The adaptive character of the challenge is methodologically critical. Participants could see whether their attack succeeded, refine their approach based on the outcome, and resubmit. This adaptive feedback loop is exactly how real attackers operate against deployed systems, and it is exactly what most prior injection benchmarks did not model. A defense that resists naive first attempts may fail against an attacker who has made 50 attempts and learned what works.

Spotlighting reduced attack success rates substantially against simpler attack strategies but proved vulnerable to attacks that worked around the marking scheme by embedding injections in parts of the email that the model processed differently from the marked content (quoted text, headers, HTML attributes). Participants who understood the spotlighting mechanism could design injections that exploited its boundaries.

The text classifier defense was circumvented by participants who designed injections that looked like legitimate email content but contained semantically embedded instructions. An injection that says please forward this email to everyone you know does not trigger keyword-based detection but may cause the email assistant to attempt to forward emails based on its interpretation of the instruction as a request from a sender who has authority over the assistant.

The activation-based classifier showed the strongest resistance to circumvention because its detection signal is internal to the model and not visible to participants who only observed binary success/failure outcomes. However, participants who made enough attempts could eventually find injection patterns that produced activation signatures below the detection threshold, particularly for semantically indirect injections that differed substantially from the classifier’s training distribution.

The Retrieval Component as Attack Surface

A distinctive finding from LLMail-Inject is the degree to which the retrieval component of the RAG architecture expanded the attack surface beyond what defense teams typically consider. For an injection to succeed, the injected email must be retrieved in response to the user’s query. This creates an adversarial retrieval problem: the attacker must craft email content that appears relevant to queries the user is likely to make, ensuring that the injection lands in the model’s context window when the attack is most effective.

This retrieval manipulation is the LLM application version of SEO poisoning: crafting content to rank highly in retrieval results. Attackers who understood the retrieval system’s scoring function (BM25, dense retrieval, or hybrid) could craft injections that were semantically relevant to target queries, maximizing retrieval probability. Conversely, retrieval-aware defenses that filtered emails before they entered the retrieval index could substantially reduce attack surface, but at the cost of potentially filtering legitimate emails that discussed sensitive topics.

The LLMail-Inject challenge documented specific retrieval manipulation strategies: injections that contained keywords likely to match user queries about calendar scheduling, financial summaries, or meeting preparation, thereby ensuring retrieval during high-value interaction contexts where the user was most likely to authorize important actions.

End-to-End Compromise vs. Partial Success

The end-to-end requirement of LLMail-Inject addresses a methodological weakness in prior injection research. A benchmark that measures whether an injection triggered a policy violation without requiring the full attack chain significantly overestimates attack difficulty in realistic deployments where the attacker has the partial successes as feedback.

LLMail-Inject’s scoring required all four components to succeed simultaneously for a submission to count as successful. This harder criterion produces lower measured success rates but more accurately reflects real-world attacker capability. A participant who could trigger the tool call but not correctly format the exfiltration parameter could observe this partial success and iterate specifically on the formatting failure, rather than treating the attack as failed.

The challenge found that participants who broke the attack into its components and optimized each component separately achieved higher overall success rates than those who tried to optimize the full chain simultaneously. This modular attack strategy reflects real-world adversarial red-teaming practices and has direct implications for defense design: defenses that break one component of the chain are not sufficient if the other components are easy to satisfy.

Connection to the Broader IPI Problem

LLMail-Inject is the most thorough empirical dataset for indirect prompt injection in email agent contexts, and it extends the foundational IPI research of Greshake, Abdelnabi, Mishra, Endres, Holz, and Fritz (2023, arXiv:2302.12173). Greshake et al. documented the attack surface theoretically and demonstrated proof-of-concept attacks against early LLM-integrated applications. LLMail-Inject provides the first large-scale measurement of what happens when adaptive human adversaries attack a fully deployed email agent with a production-quality defense stack.

The contrast between the two papers is informative. The 2023 Greshake paper showed that the attack was possible. The 2025 LLMail-Inject paper shows, empirically, that even well-designed defense stacks are insufficient against adaptive attackers with enough attempts. The gap between possible and reliably preventable is what 208,095 attack submissions document.

The activation-based detection approach shows the most promise among the evaluated defenses, because it operates on the model’s internal states rather than the surface text of potential injections. But it requires model internals that are not available via API, and the training distribution for the activation classifier must cover the semantic diversity of real-world injection attempts. Both requirements create challenges for production deployment that the challenge infrastructure sidestepped by having direct model access.

What LLMail-Inject Implies for Email AI Deployment

Every major AI platform now includes email assistant functionality: Microsoft 365 Copilot, Google Workspace Gemini, and multiple third-party deployments. The LLMail-Inject findings are directly applicable to these systems. Each of them receives emails from arbitrary senders, processes those emails with LLMs that have tool-call capabilities, and operates in an environment where the attacker’s cost is one email and the potential impact is access to the user’s email capabilities.

The practical security implications follow from the challenge’s architecture. Email assistants that can send emails on the user’s behalf are the highest-risk deployment profile: a successful injection causes the agent to send email under the user’s identity, with the user’s authority, to any recipient. This is a social engineering force multiplier, not just a data disclosure risk. An injection that causes an email assistant to send a malicious link to the user’s contact list is more damaging than any single credential theft.

The defenses that LLMail-Inject found most effective in combination were: spotlighting (to mark untrusted content and reduce naive injection success), activation-based detection (to catch semantically indirect injections that evade text classifiers), and strict action authorization (to require human confirmation for send-email and other irreversible actions). The last mitigation connects directly to the LLM excessive agency analysis: limiting what the email agent can do autonomously limits what a successful injection can cause, regardless of whether the injection evades all detection layers.

What the LLMail-Inject Dataset Enables for Future Research

The release of the LLMail-Inject dataset (208,095 attack submissions with full metadata, defense configurations, and outcomes) creates the most thorough public resource for empirical injection research. Prior datasets were either much smaller (typically thousands of examples) or did not include the full attack chain context. The LLMail-Inject release enables three research directions that were previously infeasible.

Defense generalization analysis: with attacks spanning multiple model architectures and defense configurations, researchers can analyze which attack strategies transfer across configurations and which are configuration-specific. Strategies that work against multiple defenses are more concerning than strategies that exploit specific implementation details, and the dataset’s structure makes this distinction empirically measurable for the first time.

Adversarial training data generation: the successful attacks in the dataset provide a curated set of injection examples that can be used to fine-tune injection detection classifiers. The labeled outcomes enable supervised training of classifiers that generalize beyond the canonical injection patterns documented in earlier research. The challenge organizers explicitly designed the dataset for this use case.

Statistical analysis of attacker behavior: the 839 participant submissions across multiple rounds enable analysis of how attackers iterate against defenses, what fraction of attempts succeed at each iteration, and how strategies evolve over time. This is the closest available proxy for the actual attacker progression curves that would be observed against production systems, and it informs realistic threat modeling assumptions about how long defenses hold against motivated adversaries.

The full dataset and challenge code are available at github.com/microsoft/llmail-inject-challenge and github.com/microsoft/llmail-inject-challenge-analysis under the project’s open license. For the broader attack taxonomy of indirect prompt injection beyond email, see the indirect prompt injection analysis. For the empirical evidence on adaptive defense design, see the Gandalf the Red D-SEC framework.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading