Jailbreaking vs Prompt Injection: Two Different LLM Problems

Jailbreaking vs Prompt Injection: Two Different LLM Problems
Jailbreaking vs Prompt Injection: Two Different LLM Problems

Security teams building LLM applications conflate jailbreaking and prompt injection constantly. The conflation matters because the two attacks require different defenses, operate through different channels, implicate different responsible parties, and cannot be solved by the same mechanisms. A team that spends resources on jailbreak resistance while neglecting injection architecture has done nothing to protect against the attacks that compromise production systems.

Simon Willison coined the term “prompt injection” in September 2022 specifically to distinguish it from jailbreaking. The distinction has held up: they are different attacks targeting different things, and the defense asymmetry between them is the most practically important thing to understand about LLM application security.

What Each Attack Targets

Jailbreaking targets the model itself. During training, LLMs learn to refuse certain requests: generating malware, producing violent content, providing synthesis routes for dangerous substances. This refusal behavior is a property of the model’s weights, installed through reinforcement learning from human feedback (RLHF) and similar alignment techniques. A jailbreak is any technique that gets the model to produce outputs its training told it to refuse. The attacker is trying to get past the model’s content policy. Success means generating text the model would normally decline to generate.

Prompt injection targets the application layer. Every LLM application has developer-written instructions: system prompts, tool definitions, retrieval context, conversation history. These instructions define what the application does, what data it accesses, what tools it can call, and how it should behave. Prompt injection overwrites or overrides those instructions with attacker-supplied ones. The attacker is not trying to get the model to say something dangerous; they are trying to make the model do something the application developer did not authorize. Success means redirecting the model’s actions, exfiltrating data, or calling tools in ways the application was not designed to allow.

The IOSEC.IN analysis captured this clearly: jailbreaking is a perimeter problem, where the attacker tries to get past the model’s safety layer. Prompt injection is an interior problem, where an attacker who is already inside the application’s context manipulates what the model does with its legitimate capabilities.

The Attacker Profile Asymmetry

The difference in attacker profiles is as important as the difference in attack targets, and it determines which threat intelligence is relevant for which attack.

In jailbreaking, the attacker is the user. They are directly interacting with the LLM application, crafting inputs in real time, and observing outputs. They know what they are trying to get the model to do. The attack is synchronous: the attacker sends a message, the model responds, the attacker updates their technique based on the response. The entire attack surface is the interface through which the user communicates with the model.

In prompt injection, the attacker is typically not the user at all. In indirect prompt injection, the attacker places malicious instructions in content the system will retrieve and process: a poisoned document, a web page, a database record, a tool call result. The attacker may never interact with the LLM directly. They may not know which users will eventually trigger the attack. The attack is asynchronous: the attacker poisons data and waits. The attack surface is every external data source the application can access.

This asymmetry means jailbreak threat intelligence (specific attack phrases, techniques, adversarial prompts) is only useful for detecting and blocking direct user-facing attacks. It provides no signal for detecting injection attacks that arrive through documents, tool results, or retrieved content. Teams that build their detection capability entirely around blocking known jailbreak patterns are unprotected against injection from external data sources.

The RLHF Paradox

The same training technique that makes models more resistant to jailbreaks makes them more susceptible to prompt injection. This is not a coincidence. It is a consequence of what RLHF optimizes for.

RLHF trains models to follow human instructions more reliably. Human raters evaluate model responses and prefer responses that follow instructions accurately, helpfully, and completely. The training signal pushes the model toward instruction-following. Over thousands of training examples, the model develops a strong prior toward treating any instruction in its context as something it should follow.

For jailbreaking, this training dynamic creates a useful defense: the model has been specifically trained to treat safety refusals as instructions to follow, and RLHF trains it to follow those refusals reliably. A model that has been extensively RLHF-trained is harder to convince to ignore its refusal training, because following instructions (including its own refusal instructions) is exactly what the training shaped it to do.

For prompt injection, the same dynamic is a vulnerability amplifier. A model trained to reliably follow instructions in its context will reliably follow injected instructions in its context. The injection succeeds not despite the model’s training but because of it. The more instruction-following the model is, the more effectively it executes whatever instructions an attacker embeds in a retrieved document or tool result.

The AgentDojo benchmark (Debenedetti et al., NeurIPS 2024) documented this empirically: models that were more instruction-following in general were more useful for legitimate tasks and more vulnerable to injections. Models with stronger refusal training were more resistant to injections but also more likely to fail legitimate tasks. There is no current model that achieves both high injection resistance and high task performance simultaneously.

Product Problem vs Architectural Problem

Jailbreaking is, at its core, a product problem. A jailbreak that works today can be patched. The model provider identifies the technique, adds examples to the training data, retrains or fine-tunes the model, and deploys the update. The attacker publishes a new jailbreak technique; the model provider publishes a patch. This is a familiar security dynamic: it is the same cycle as signature-based antivirus or CVE patching. It is never finished, but it is tractable. OpenAI’s GPT-5 system card reported 99.5%+ not_unsafe rates across harm categories, which reflects years of jailbreak iteration and patching.

Prompt injection is an architectural problem. The root cause is that LLMs have no privilege system: developer instructions, user inputs, retrieved documents, and tool results all arrive as tokens in the same flat sequence, processed by the same attention mechanism. There is no hardware boundary separating trusted instructions from untrusted content. This is not a model behavior that can be trained away without changing what it means for LLMs to follow instructions. Defenses that work at the model layer (training for injection resistance) consistently show the same tradeoff: reduced injection success rates and reduced task performance. No training-only fix has eliminated the vulnerability.

The architectural nature of injection means that defense belongs primarily at the application layer, not the model layer. Access control (limiting what tools and data the agent can reach), output auditing (checking what the model produced against the user’s original intent), and session-level monitoring (detecting unusual behavior patterns across turns) are all application-layer defenses. None of them require a better model. They require a better application architecture.

Responsibility Asymmetry

The product-vs-architectural distinction maps directly onto who is responsible for defense.

For jailbreaking, the model provider is the primary responsible party. OpenAI, Anthropic, Google, and Meta train the models. They have access to the weights, the training data, and the RLHF process. When a jailbreak technique is discovered, the model provider is the only party positioned to train it out. Application developers can add a second layer of output filtering, but they cannot patch the underlying model behavior. The responsibility is with the provider.

For prompt injection, the application developer is the primary responsible party. The model provider cannot make injection impossible without removing the capability that makes LLMs useful for agentic tasks. The application developer decides what external data sources the agent accesses, what tool permissions it carries, what system prompt architecture governs its behavior, and what monitoring catches anomalous actions. A developer who builds an agent with service-role database access that processes user-submitted content has made a security decision that no model update can fix. The responsibility is with the developer.

This responsibility asymmetry has organizational implications. Teams that treat LLM security as entirely the model provider’s problem have misallocated responsibility for injection. Teams that treat all LLM security as the application developer’s problem have misallocated responsibility for jailbreaking. Both parts of the security posture exist, but they require different owners and different mitigation strategies.

Common Techniques and Where They Apply

Some attack techniques apply exclusively to jailbreaking: DAN (Do Anything Now) persona prompts that ask the model to roleplay as an unrestricted AI, many-shot prompting that gradually normalizes restricted content through repeated examples, crescendo attacks that slowly escalate request severity, and encoding tricks that present harmful requests in base64 or other transformations. These all target the model’s content policy directly and are irrelevant to injection attacks that arrive through external data.

Some attack techniques apply primarily to injection: embedding instructions in document text using invisible Unicode characters, placing injected content after enough legitimate text that retrieval systems score it as highly relevant, appending instructions in code comments or HTML attributes that render invisibly in browsers, and multi-step manipulation that establishes context across turns before the actual redirect. These exploit the application’s data pipeline rather than the model’s content policy and are undetectable by jailbreak-focused defenses.

Some techniques overlap: context manipulation that convinces the model a different set of instructions is authoritative can appear in both direct user messages (jailbreak) and injected external content (injection). The detection challenge is different in each case: jailbreak detection applies to user-originated inputs; injection detection applies to content retrieved from external sources. The same technique applied through different channels requires different detection logic.

Defense Mapping

For jailbreaking: the primary defense is model-level. Choose models with strong alignment training. Use output filtering to catch harmful content that bypasses alignment. Monitor for known jailbreak patterns in user inputs. Accept that jailbreak resistance is a continuous arms race that the model provider is fighting on your behalf, and that the current generation of aligned models provides substantial (though not complete) protection for most deployment contexts.

For prompt injection: the primary defense is architectural. Scope tool access to the minimum required. Use per-user OAuth delegation instead of service accounts. Build output auditing that checks whether model actions are consistent with the user’s original request. Implement session-level detection as documented in the Gandalf the Red D-SEC framework: flagging suspicious patterns across turns catches injection attempts that look normal at the individual-turn level. Limit what injections can cause even when they succeed, because no filtering approach eliminates injection entirely.

For MCP-connected agents specifically, the tool description poisoning attack class requires version-pinned server configurations and gateway-level description sanitization, as covered in the MCP server security analysis. This is an injection attack delivered through configuration metadata rather than runtime content, which means jailbreak detection has zero coverage over it.

The OWASP LLM Top 10 for 2025 places prompt injection at LLM01 (the top slot) and covers jailbreaking as a subset of it, which reflects the practical priority but can reinforce the conflation. In OWASP’s framing, jailbreaking is a form of direct prompt injection targeting the model’s safety training. The architectural distinction remains valid for defense planning: addressing LLM01 at the model layer (alignment, refusal training) is not the same as addressing it at the application layer (architecture, access control, monitoring), and both are required.

The Practical Implication

A team building a production LLM application needs both defenses, applied at different layers and owned by different parties. The model provider handles the model layer; the development team handles the application layer. Conflating the two leads to either over-relying on the model provider for protections only the application can provide, or over-investing in application-layer defenses for attacks that are the model provider’s responsibility to handle.

The clearest signal that a team has conflated the two is a security posture that consists entirely of output filtering. Output filtering catches jailbreak outputs that the model produces despite its alignment training. It does nothing to prevent an injection that redirects the model to call a tool with attacker-specified parameters, exfiltrate data through a legitimate tool call, or take actions the user never requested. The tool call happens before the output is generated. By the time output filtering runs, the injection has already succeeded.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading