Red-Teaming LLM Applications: A Practitioner’s Framework

Red-Teaming LLM Applications: A Practitioner’s Framework
Red-Teaming LLM Applications: A Practitioner’s Framework

Red-teaming an LLM application is not the same as red-teaming a traditional web application. A SQL injection test has a clear pass/fail criterion: either the query executed or it did not. An LLM red-team test has three distinct threat surfaces (the model layer, the application layer, and the supply chain), each requiring different attack techniques, different measurement methodologies, and different success criteria. A red-team engagement that tests only one surface and declares the system secure has missed the other two.

This framework organizes LLM application red-teaming around the three threat surfaces, maps each to the available testing methodologies and benchmarks, and describes what to measure beyond block rates. The goal is not a comprehensive catalog of attack techniques but a structured approach to deciding which techniques apply to which surface, in which order, and how to interpret the results.

The Three Threat Surfaces

Before any test is designed, the threat surface needs to be mapped. LLM applications have three distinct surfaces, and conflating them leads to incomplete testing and misdirected remediation.

The model layer covers jailbreaking: attacks that target the model’s content policy to produce outputs the model’s training told it to refuse. Testing the model layer means attempting to elicit restricted content through direct user interaction. The responsible party for defense is the model provider. Red-team findings at the model layer inform model selection and deployment decisions but generally cannot be remediated by the application developer alone.

The application layer covers prompt injection: attacks that override developer-written instructions with attacker-supplied instructions. Testing the application layer means attempting to redirect the model’s actions through user inputs, retrieved documents, tool call results, and any other external content the application processes. The responsible party for defense is the application developer. Red-team findings at the application layer are fully within the developer’s control to remediate through architectural changes.

The supply chain layer covers model and skill poisoning: attacks that compromise the models, adapters, or agent skill files used by the application before they reach deployment. Testing the supply chain layer means verifying the provenance of every model and plugin in the application stack. A supply chain compromise invalidates all application-layer and model-layer defenses simultaneously.

Start With Threat Modeling

Before testing anything, answer the question: what happens if the application is compromised? For a customer support chatbot with read-only knowledge base access, a successful injection can produce incorrect responses. For a coding agent with file system access and shell execution, a successful injection can exfiltrate code and install backdoors. The same injection technique has radically different severity depending on what the agent can do.

OWASP LLM06 (Excessive Agency) is as much a threat modeling concept as a vulnerability class. Before red-teaming, audit the application’s tool access, credential scope, and autonomy level. Every tool and permission the agent carries is a potential consequence of a successful injection. Documenting this before testing produces a consequence matrix: what is the worst-case outcome of a successful attack at each layer? This matrix prioritizes the testing effort and sets the severity threshold for findings.

Testing the Model Layer: Jailbreaking

Model-layer testing evaluates how effectively the deployed model resists content policy bypass. Standard jailbreak testing covers the common technique classes: persona prompts that ask the model to roleplay as an unrestricted AI, many-shot normalization that gradually escalates request severity, crescendo attacks that escalate from adjacent topics, and encoding tricks that transform restricted requests into base64 or other formats.

OpenAI’s GPT-5.4 system card reports 99.5%+ not_unsafe rates across most harm categories. These figures provide a baseline expectation but reflect testing against known techniques. The practical output of model-layer testing is not a security clearance but a risk estimate: what is the probability that a motivated user finds a successful technique, and what is the consequence? For most applications, the model provider’s alignment training provides substantial coverage, and the priority shifts to the application layer where the developer has more leverage.

As analyzed in the jailbreaking vs prompt injection analysis, the RLHF paradox means that models optimized for instruction-following (which improves jailbreak resistance) may be simultaneously more susceptible to prompt injection. Model-layer improvements do not substitute for application-layer defense.

Testing the Application Layer: Prompt Injection

Application-layer testing is where most developer-addressable risk lives and where the most empirically grounded methodologies exist.

The Gandalf-RCT dataset (279,000 prompt attacks with outcome-based success labels) is publicly available at huggingface.co/datasets/Lakera/gandalf-rct. Running your application against a representative sample provides a baseline comparison against the Gandalf defense configurations. The dataset covers social engineering, roleplay-based extraction, encoding tricks, multi-step manipulation, and indirect extraction, mapping to the attack categories most relevant for production injection testing.

The foundational methodological principle from the Gandalf the Red research: measure whether the attack succeeded (did it change the model’s output or actions in the way the attacker intended?), not whether the attack prompt looked adversarial. Intent-based filtering that blocks prompts that look dangerous systematically misses attacks that succeed while looking benign.

For agentic applications, the AgentDojo benchmark (Debenedetti et al., NeurIPS 2024) tests 97 agent tasks across 629 injection scenarios and measures task completion rate and injection resistance simultaneously. The D-SEC framework from Gandalf the Red formalizes this as a joint optimization objective. The three defense configurations with the best security-utility profiles: restricted application domain, defense-in-depth (system prompt plus output-level auditing), and adaptive defenses (session-level detection, not just per-turn). Testing these against your application, measuring both injection success rates and task performance, produces the data needed to choose a configuration appropriate for your risk tolerance.

The b3 Benchmark for Backbone Evaluation

The b3 benchmark (Bazinska et al., 2025) provides standardized security evaluation across 31 backbone models using threat snapshots that isolate backbone behavior at specific decision points independently of the scaffolding. Key findings for backbone selection: reasoning-capable models are more secure than base models, model size does not predict security, and open-weight models are closing the gap with closed frontier models faster than expected.

B3 test scenarios cover chat, document processing, tool invocation, memory manipulation, code execution, and file processing. For any agentic application using these interaction patterns, the relevant b3 scenarios provide a backbone security estimate before deployment. The benchmark is available at github.com/lakeraai/b3. The Julia Bazinska research profile covers the b3 methodology and partial credit scoring system in detail.

Testing the MCP Attack Surface

For applications using MCP servers, standard injection testing tools miss the tool description poisoning surface entirely. MCP security testing covers three areas. First, audit every tool description in every connected MCP server for injected instructions, including sections truncated in the IDE’s tool panel display and content after long legitimate sections. Second, verify MCP server configurations are version-pinned with change notifications on modification. Third, test whether modified tool descriptions would silently propagate without re-triggering user approval.

The MCP server security analysis covers the tools/list mechanism and both CVEs (MCPoison and CurXecute) as concrete test cases for this surface.

Supply Chain Verification

Supply chain testing is a verification audit. For each model, adapter, and agent skill file, verify: SHA-256 checksums against known-good values published through channels separate from the model repository; publisher identity against the intended organization (checking for typosquatting and namespace re-registration); and agent skill file content for embedded instructions in description fields and annotation sections.

The PoisonedSkills paper (arXiv:2604.03081) found DDIPE bypass rates of 11.6% to 33.5% against production agent frameworks including Claude Code. Review skill files as you would audit a third-party system prompt. The LLM supply chain attack analysis covers the ROME technique, namespace attacks, and verification procedures.

What to Measure Beyond Block Rate

Block rate measures what the defense blocked, not what it missed, what it degraded, or whether the blocked attacks were the ones that mattered. Four measurements provide a complete security posture picture.

Injection success rate with adaptive attackers: allow the attacker to iterate based on observed model responses. The Gandalf-RCT data shows adaptive attackers succeed at substantially higher rates than static attackers against the same defenses.

Utility penalty: measure response quality under the active defense configuration versus a baseline without it. The Gandalf the Red paper found system prompt-based defenses degrade utility even when they block nothing. Hidden operational costs that static security testing misses entirely.

False positive rate: how often does the defense flag or block legitimate user requests? High-sensitivity defenses with high false positive rates degrade user experience and may cause users to route around them.

Blast radius per successful injection: for each successful injection in testing, document what the attacker achieved. A defense blocking 90% of attacks but allowing the remaining 10% to achieve full credential exfiltration may be less useful than one blocking 70% while limiting all successful attacks to low-consequence actions.

The OWASP Framework as Test Plan Structure

The OWASP LLM Top 10 for 2025 provides a structured framework for organizing the test plan. Each vulnerability class maps to distinct test cases, and the highest-impact attack chain (LLM01 + LLM06 + LLM05) suggests the remediation priority order.

For a minimal viable red-team engagement, the evidence-based priority order is: LLM06 (Excessive Agency) remediation first (reduces blast radius of all other vulnerabilities), LLM07 (System Prompt Leakage) second (removes attacker reconnaissance capability), and LLM01 (Prompt Injection) adaptive defense third. This sequencing maximizes security improvement per unit of remediation effort and is supported by the Gandalf the Red, b3, and AgentDojo evidence bases.

Red-teaming is not a one-time exercise. The attack surface for LLM applications changes when the model is updated, when new tools are added, when retrieval data sources change, and when the application’s scope expands. The D-SEC framework’s measurement methodology, the b3 benchmark, and the OWASP taxonomy all provide infrastructure for continuous security measurement rather than point-in-time assessment. The goal is a calibrated, continuously updated picture of the application’s risk posture across all three distinct attack surfaces, not a security certification at a single moment in time.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading