Julia Bazinska and the Science of Measurable AI Security

Most AI security research produces claims. Julia Bazinska produces measurements. The distinction sounds minor until you realize that almost every defense deployed in a production LLM application today is backed by claims, not measurements, and that the gap between the two is where real attacks succeed.

Bazinska is a Senior Research Engineer at Lakera AI, the Zurich-based AI security company acquired by Check Point in 2025. She joined Lakera in September 2023 after completing her MSc in Computer Science at ETH Zürich, following a BSc at the University of Warsaw where she served as president of the Machine Learning Society at MIM UW. Between her degrees she interned at Google, IBM, and DeepMind, where she worked on a reinforcement learning library. At Lakera she has contributed to the three research outputs that have done more than any other work to put empirical foundations under LLM security: the Gandalf platform, the Gandalf the Red ICML 2025 paper, and the backbone breaker benchmark (b3), on which she is first author.

Her GitHub is github.com/lamyiowce and her HuggingFace profile at huggingface.co/jb-lakera. The datasets and code her team released are MIT licensed and available to use today.

The Problem She Is Solving

LLM security had a measurement problem before Lakera’s research team started fixing it. The field ran red-teaming exercises, published block rates, and reported “the defense works” without ever asking two questions that matter enormously in practice: does the defense still work when the attacker tries more than once, and what is the defense doing to your legitimate users while it blocks attacks?

The first question is about adaptive attacks. Most published evaluations test a defense once against a fixed adversarial corpus. Real attackers iterate. They send a probe, observe the response, and refine. A defense that blocks a naive first attempt may fail against a tenth attempt from the same attacker who has learned what the model responds to. Static evaluations miss this entirely.

The second question is about utility. Security teams measure attack block rates. Product teams measure user experience. Almost nobody measures both simultaneously for the same defense configuration. This means a security team can ship a defense that reduces attack success by 30% while also reducing response quality for legitimate users by 20%, and nobody notices, because the measurements are running in separate systems with no shared denominator.

Bazinska’s work at Lakera has been to build the infrastructure that makes both measurements possible at scale, and then to produce the empirical results that show what the answers actually are.

Gandalf: Making Crowdsourced Red-Teaming Scientific

Gandalf (gandalf.lakera.ai) started as a hackathon project in 2023. Players try to trick an LLM into revealing a secret password. The game accumulated 80 million data points from over 200,000 unique players. Those players spent a combined 25 years interacting with the platform.

The scale is impressive, but what makes Gandalf scientifically valuable is a design choice that sounds simple but is methodologically significant: success is determined by outcome, not intent. Either the password was extracted or it was not. There is no human annotation of whether a prompt looks adversarial. There is no ambiguity about borderline cases. The success label is automatic, binary, and unambiguous.

This outcome-based labeling is the first methodological contribution of the Lakera research program. It eliminates the false positive problem that plagues intent-based red-teaming datasets. A defense that stops prompts that look adversarial will be measured as successful even if it consistently fails against prompts that succeed without looking suspicious. Outcome-based labeling catches those failures. It measures what actually matters: did the attacker get what they came for?

The second methodological contribution is the randomized controlled trial structure of Gandalf-RCT, the experimental subset used in the ICML 2025 paper. Players were randomly assigned to defense conditions across three application setups. Random assignment is what creates causal estimates rather than correlational ones. It means the paper can say “defense A is more effective than defense B” rather than “we observed better outcomes when defense A was present, but we cannot rule out that more capable attackers chose condition B.” This distinction matters: most red-teaming papers cannot make causal claims. Gandalf-RCT can.

Gandalf the Red: What the 279K Dataset Found

The ICML 2025 paper “Gandalf the Red: Adaptive Security for LLMs” (Pfister, Volhejn, Knott, Bazinska et al., arXiv:2501.07927) used the Gandalf-RCT data alongside synthetic benign user data to test something most security evaluations ignore: what defenses do to legitimate users.

The paper’s most practically important finding is that system prompt-based defenses reduce the quality and length of responses to benign user queries even when the defense never triggers a block. The model does not refuse the legitimate request. It responds. But security-focused language in the system prompt shifts the model’s global response distribution toward shorter, more conservative outputs across all interactions, not only adversarial ones. The defense is quietly degrading every session, not just the attack attempts.

The paper formalizes this through D-SEC (Dynamic Security-Utility Threat Model), a three-party framework covering developer, attacker, and user. D-SEC expresses the security-utility trade-off as an optimizable objective, rather than treating it as an unmeasured side effect of defense choices. It gives developers a principled way to choose how much usability they are willing to trade for how much security, and to verify that the trade-off is what they intended.

Three defense strategies emerged with strong security-utility profiles from the empirical analysis: restricting the application domain (narrowing what the LLM is allowed to do), defense-in-depth (combining a system prompt defense with a separate output-level checker), and adaptive defenses (blocking users after a session-level threshold of suspicious prompts rather than per-turn only). Each addresses a different layer of the attack surface. None of them is sufficient alone. Together, they provide substantially better security than system prompt hardening alone, at lower utility cost. The full technical breakdown of these mechanisms, with the empirical data from each defense configuration tested, is in the companion analysis of the Gandalf the Red paper.

The full dataset (279,000 prompt attacks) and the D-SEC code are MIT licensed at huggingface.co/datasets/Lakera/gandalf-rct and github.com/lakeraai/dsec-gandalf.

Gandalf: Agent Breaker

The original Gandalf game tests a single behavior: can the attacker extract a password from a chatbot. That is a useful starting point, but it does not capture the attack surface of an AI agent that can call APIs, read files, write to databases, and send messages on behalf of a user. Bazinska was central to designing and launching Gandalf: Agent Breaker, which extended the game to ten realistic agentic application scenarios.

The ten scenarios in Agent Breaker cover chat-based interactions, code execution, file processing, memory manipulation, external tool usage, and multi-step workflows. Each scenario has multiple difficulty levels and layered defenses. The game simulates real-world agent behavior more faithfully than the original Gandalf, because the attack surfaces in agentic systems are qualitatively different from those in stateless chatbots. An attacker in an agentic setting is not trying to extract a static secret. They are trying to redirect the agent’s actions toward outcomes the developer did not intend.

The game generated 194,331 unique crowdsourced attack attempts before the research team extracted the curated benchmark dataset. That dataset became the empirical foundation for b3. The nature of these agentic attacks, how they differ from simple password-extraction attempts, and what architectural defenses can limit them connects to the indirect prompt injection attack surface and to the excessive agency vulnerability that makes successful injections dangerous in agentic settings.

b3: Breaking Agent Backbones

“Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents” (Bazinska, Mathys, Casucci, Rojas-Carulla, Davies, Souly, and Pfister, 2025) is Julia Bazinska’s first-author paper and the most direct expression of the measurement methodology she has been building since joining Lakera.

B3 is built around a concept Bazinska and her co-authors call threat snapshots. The problem with evaluating agent security by testing full agent workflows is that agents are complex systems and failures can occur anywhere in the system stack, not only in the backbone LLM. A full-workflow evaluation cannot isolate whether it is the LLM that failed or the scaffolding, the tooling, or the orchestration logic. Threat snapshots solve this by zooming into the exact moments where the backbone LLM makes a decision under adversarial pressure, testing that decision point in isolation from everything else around it.

Each threat snapshot is a freeze-frame of an agent under attack at a specific decision point. The backbone receives input that includes an adversarial element (a malicious instruction embedded in a document, a phishing payload in a tool response, a prompt injection in a web page being processed) and the test measures whether the backbone produces a safe or unsafe output at that moment. The snapshot is small, fast to run, reproducible, and comparable across different backbone models. This is what makes b3 a benchmark rather than a red-teaming exercise: its results are standardized enough to compare GPT-4o against Claude Sonnet against Llama-3-70B under identical conditions.

The b3 benchmark combines 10 threat snapshots with a curated dataset of 19,433 adversarial attacks, selected from the 194,331 generated by Agent Breaker players. The selection prioritized diversity and difficulty, including successful attacks that bypassed the most capable defense configurations available during testing.

Initial testing across 31 popular LLMs produced three findings that none of the prior agentic security evaluations had produced empirically. Reasoning-capable models (those fine-tuned to reason step by step before producing output) are measurably more secure than base models at the backbone level. Model size, measured in parameters, does not predict security. A larger model is not a more secure model. And open-weight models are closing the security gap with closed frontier models faster than most observers anticipated, which has implications for deployment decisions where open-weight models are attractive for cost or privacy reasons.

B3 was released jointly with Check Point and the UK AI Safety Institute (AISI) on October 28, 2025, under MIT license at github.com/lakeraai/b3.

The Partial Credit Measurement Innovation

One of the most underappreciated innovations in the Agent Breaker and b3 methodology is the shift from binary attack labeling to continuous scoring. Agent Breaker scores each attack attempt on a 0-100 scale measuring how much of the attacker’s objective was achieved rather than treating success as a binary event. An attack that caused the agent to begin executing a harmful action but was interrupted partway through is not the same as an attack that was blocked immediately. The continuous score captures that distinction.

For the b3 dataset curation, Bazinska’s team selected attacks scoring 75 or above on this scale, meaning attacks that achieved at least 75% of their adversarial objective in the Agent Breaker game. These attacks were then re-evaluated across all seven backbone LLMs in the b3 benchmark. This selection methodology ensures the benchmark tests models against attacks with high and consistent adversarial impact, not edge cases that occasionally produce a lucky full success.

The practical significance of this scoring approach extends beyond dataset curation. Binary attack labeling gives defenders a misleading picture of their actual security posture. A model that reduces the average attack score from 80 to 35 has cut the attacker’s expected outcome in half, a meaningful security improvement, even if the binary block rate has not changed. Conversely, a model that blocks 95% of attacks but allows the remaining 5% to reach scores of 95+ may be more dangerous in production than a model that blocks 80% of attacks but caps the remaining 20% at scores below 30. The binary metric misses this entirely. The continuous score surfaces it.

This is another expression of the outcome-based measurement philosophy that runs through all of Bazinska’s work: measure what the attacker actually achieved, not what the attack looked like. The scoring system is an extension of the same insight that makes outcome-based labeling (did the password get extracted?) more informative than intent-based labeling (did the prompt look adversarial?).

The Methodological Thread

Looking across Bazinska’s three major research contributions at Lakera, the methodological principles are consistent. Measure outcomes, not intent. Use randomized assignment to enable causal claims, not just correlational observations. Measure security and utility jointly, not separately. Design evaluations that isolate the variable you actually care about (the backbone, the session-level behavior, the specific defense mechanism) rather than testing the entire system at once and attributing effects you cannot trace.

These principles are not exotic. They are the standard toolkit of empirical science. What is notable is that they were largely absent from LLM security research before this work appeared. The field was producing red-teaming reports that sounded like experiments but lacked the structure needed to make defensible causal claims. Bazinska’s contributions, collectively, have pushed LLM security toward methodological standards that empirical ML research uses everywhere else.

The RL background from DeepMind is visible in this work. Reinforcement learning research has a long tradition of careful environment design to enable valid empirical claims, and a corresponding tradition of skepticism toward evaluation protocols that allow confounds to masquerade as results. The Gandalf-RCT design, the outcome-based labeling, the threat snapshot isolation methodology, and the partial credit scoring all reflect that tradition applied to a new domain. RL researchers learned decades ago that evaluation environments must be carefully designed to produce valid estimates of policy quality. Bazinska is applying the same discipline to evaluation environments for security policies.

What Is Still Open

The work Bazinska has published leaves clearly marked open problems. The Gandalf the Red paper’s synthetic benign user data is an acknowledged limitation: utility penalties are measured against a synthetic baseline, not a real user population. Closing this gap requires either collecting real user data (which has its own privacy complications for a security research platform) or developing better methods for synthetic user simulation that more faithfully reproduce the query distribution of legitimate users in diverse application domains.

The b3 benchmark’s threat snapshot approach isolates the backbone deliberately, but this means it does not measure security failures that occur at the scaffolding or tool layer. Those failures exist and matter. A backbone that resists every prompt injection may still be compromised by a tool that returns adversarial content the backbone processes without suspicion. A complete agentic security evaluation framework will eventually need to cover both the backbone layer and the scaffolding layer. The b3 work establishes the backbone piece; the scaffolding piece is a stated open problem.

The attack taxonomy from the Gandalf the Red paper and b3 was constructed from attacks against specific application types with specific defense configurations. Whether this taxonomy generalizes to the full diversity of LLM application types, and whether the success rates per category are stable across different backbone models, is not yet established empirically. These are tractable research questions that the public datasets now make possible to answer without access to Lakera’s proprietary infrastructure.

Why This Matters for the Field

AI security is at an early stage where almost all the work being done is proprietary, undisclosed, or anecdotal. Companies run internal red teams whose results they do not publish. Vendors publish marketing claims about block rates with no reproducible methodology. The published academic literature is dominated by synthetic evaluations with clean threat models that do not reflect what production attackers actually do.

The Lakera research program, and Bazinska’s contributions to it specifically, represent a different approach: build the empirical infrastructure first, run the experiments rigorously, publish the data and code under open licenses, and let the field verify and build on the results. The Gandalf-RCT dataset, the D-SEC codebase, and the b3 benchmark are all public, reproducible, and usable today. Any team building an LLM application can run b3 against their backbone model and get a standardized security score that is comparable to published baselines on 31 other models.

That is what turning AI security into a measurable science looks like in practice. It starts with getting the measurement methodology right, which is the harder problem. Once the measurements are valid, everything else follows. The numbers can be improved, the defenses can be iterated on, and progress can be tracked. Without valid measurements, you are iterating on claims, which is a much slower and less reliable process.

For the broader context on where these vulnerabilities sit in the full LLM security landscape, the OWASP LLM Top 10 for 2025 maps the attack surfaces Bazinska’s work measures to a standard vulnerability taxonomy that practitioners can use for threat modeling. For the companion technical analysis of the Gandalf the Red framework, the Gandalf the Red deep-dive covers the D-SEC mechanism and what the 279,000-attack dataset found about adaptive defenses. For the reporting on the attack landscape her research is designed to defend against, see the coverage of RAG poisoning in clinical systems and MCP server prompt injection attacks.

Julia Bazinska and the Science of Measurable AI Security

The Problem She Is Solving

Gandalf: Making Crowdsourced Red-Teaming Scientific

Gandalf the Red: What the 279K Dataset Found

Gandalf: Agent Breaker

b3: Breaking Agent Backbones

The Partial Credit Measurement Innovation

The Methodological Thread

What Is Still Open

Why This Matters for the Field

Like this:

More posts

LLM Excessive Agency: Why Every Tool Your Agent Has Is a Risk

OWASP LLM Top 10 for 2025: The Mechanism Behind Each Vulnerability

Indirect Prompt Injection: The Attack That Hides in Your Data