How RLHF and Constitutional AI Build Safety Into Language Models

Every frontier language model deployed today has been shaped by a training process designed to make it behave the way its developers intended: to be helpful, to avoid producing harmful content, to follow instructions, and to refuse requests that fall outside its designed scope. That shaping is not a function of the base model’s pre-training. It is the result of a second training stage, applied after the model has learned to predict language, that installs specific behavioral preferences into the weights. Understanding what that stage actually does, mechanically, is what determines whether a developer or security researcher can have any principled expectations about model behavior under adversarial conditions.

The two dominant techniques for this behavioral shaping are Reinforcement Learning from Human Feedback (RLHF), introduced by Christiano, Leike, Brown, Martic, Legg, and Amodei at OpenAI in 2017, and Constitutional AI (CAI), developed at Anthropic by Bai, Jones, Ndousse, Askell, Chen, DasSarma, Drain, Fort, Ganguli, Henighan, and others in 2022. Both techniques were designed to produce models that behave in accordance with human values. Neither provides formal guarantees. Both have structural implications for how jailbreaks work, why prompt injection exploits the same training, and where the safety-utility tradeoff lives.

The Problem RLHF Solves

Pre-training a language model on a large text corpus produces a model that can complete text. It does not produce a model that follows instructions, answers questions helpfully, or refuses harmful requests. The pre-trained model has learned statistical patterns over the training distribution. If the training distribution contains examples of helpful responses, unhelpful responses, harmful content, and benign content, the model will generate all of these with probabilities proportional to their frequencies in the data.

Supervised fine-tuning (SFT) was the first approach to behavioral shaping: fine-tune the pre-trained model on examples of instruction-following behavior, curated by human annotators. The model learns to respond helpfully when given instructions. But SFT requires large amounts of annotated examples, and the annotations capture what annotators demonstrate (writing helpful responses) rather than what they actually prefer (comparing two responses and saying which is better). Capturing demonstrated behavior and capturing comparative preferences are different things, and for nuanced behavioral guidelines, the comparative approach turns out to be more resilient and scalable.

RLHF uses comparative preferences. Instead of requiring annotators to write ideal responses, RLHF asks annotators to compare two candidate responses and indicate which they prefer. The reward model is trained on these pairwise comparisons: it learns to predict which of two responses a human would prefer. The base language model is then fine-tuned using reinforcement learning to maximize the reward model’s score, with a constraint (typically implemented via KL divergence) that prevents the fine-tuned model from deviating too far from the SFT model.

The Three-Stage RLHF Pipeline

He et al. (2025, CAAI Transactions on Intelligence Technology, doi:10.1049/cit2.70084) describe the standard RLHF training stage sequence that has become the dominant approach for behavioral shaping across frontier models. The first stage is supervised fine-tuning on instruction-following demonstrations. The second stage is reward model training on pairwise human preference comparisons. The third stage is RL fine-tuning using the trained reward model as a proxy for human preferences, typically using Proximal Policy Optimization (PPO, Schulman et al. 2017).

The reward model training stage is where the behavioral policy is encoded. Human annotators are presented with pairs of completions to the same prompt and asked which they prefer. These comparisons are collected at scale. The reward model learns a scalar preference score function. This reward model becomes a compressed representation of the annotators’ collective behavioral preferences.

The RL fine-tuning stage uses PPO to maximize the expected reward model score across a distribution of prompts, while the KL divergence constraint prevents the model from moving so far from the SFT baseline that it produces nonsensical text in pursuit of high reward scores.

The resulting model is one that has internalized behavioral preferences from the training process. It does not consult a rule list at inference time. Its preferences are encoded in its weights through the RL fine-tuning gradient updates. When it refuses a harmful request, it is not checking the request against a blocklist. Its weights, shaped by the reward model’s preferences, assign lower probability to harmful completion sequences and higher probability to refusal sequences.

Direct Preference Optimization: The PPO Alternative

He et al. (2025, doi:10.1049/cit2.70084) note that the complexity of PPO-based RL fine-tuning has motivated alternatives. Direct Preference Optimization (DPO, Rafailov, Sharma, Mitchell, Ermon, Manning, and Finn, 2023) reformulates the RLHF objective to optimize a language model directly from pairwise preference data without explicitly training a reward model or running RL. DPO shows that the optimal policy under the RLHF objective can be expressed as a closed-form function of the pairwise preference data and the reference model, eliminating the reward model training stage and the unstable PPO optimization.

DPO has become increasingly popular for fine-tuning because it is substantially more computationally tractable than PPO-based RLHF. The behavioral results are comparable for many tasks, though some evidence suggests that PPO-based RLHF produces stronger instruction-following capability at the frontier. For safety-relevant behavior specifically, the comparison between DPO and PPO is an active area of research with no settled consensus at the time of writing.

Constitutional AI: Self-Critique With Principles

Constitutional AI (CAI), introduced by Bai et al. at Anthropic (2022), extends the RLHF approach with a key architectural change: the feedback used to train the model’s behavior comes primarily from the model itself, guided by a written set of principles (the constitution), rather than from human annotators rating individual responses.

The mechanism, as described by Lazar (2025, Philosophy and Public Affairs, doi:10.1111/papa.12279), works as follows. First, supervised learning is used to train the model to follow instructions. The instruction-tuned model is then given two distinct tasks simultaneously. One instance generates two completions each in response to many prompts. The other instance ranks those completions against a set of principles. In Constitutional AI, this involves ranking completions against one principle from the constitutional set at a time. This feedback is encoded into a reward model, on which the instruction-tuned model is then further fine-tuned using reinforcement learning. The process is repeated with new principles added as necessary until the fine-tuned model reliably generates value-aligned responses to a given prompt.

The constitutional principles serve as the encoding of what harmless and helpful mean, operationalized as natural language criteria that the model uses to evaluate its own outputs. Examples from Anthropic’s published constitution include principles about honesty (not deceiving users), harm avoidance (not providing information that enables serious harm), and autonomy (not manipulating users against their interests). The model trained with these principles develops a disposition toward compliance with them not by consulting them at inference time but by having had gradients shaped by a reward model that encoded these preferences.

Lazar (2025) notes the epistemological character of this process: in metaphorical terms, it effectively encodes the model’s understanding of its natural language principles directly into its mathematical weights. This encoding is not perfect, does not provide formal behavioral guarantees, and cannot be verified externally. The weights encode a statistical approximation of the principles, not a logical implementation of them. Whether the approximation holds under arbitrary adversarial inputs is the central question of jailbreak research.

What the Training Installs and What It Cannot

The RLHF and CAI training stages install behavioral dispositions into model weights through gradient updates. These dispositions are general and statistical: the model has learned to increase the probability of responses that received high reward and decrease the probability of responses that received low reward, across a large and diverse training distribution.

Several things follow from this mechanism that have direct implications for security.

The training does not produce formal behavioral constraints. A model trained with RLHF to refuse requests for malware does not have a logical check that detects this is a malware request and returns a refusal. It has a statistical disposition, shaped across many training examples, that makes refusal the highest-probability completion for most inputs that resemble malware requests. Adversarial inputs that are outside the distribution of training examples may not activate this disposition effectively.

The training installs a single unified behavioral policy, not a rule set. There is no module responsible for safety that can be disabled by prompt manipulation while leaving the model’s capabilities intact. The safety behavior and the capability behavior are encoded in the same weights by the same training process. This is why jailbreaks that ask the model to ignore safety training or roleplay as an unrestricted AI are partially effective: they are not actually bypassing a safety module. They are prompting the model to generate completions from a different part of its distribution, one that the training did not populate with as many refusal examples.

The training creates a RLHF paradox relevant to injection security. Schuett (2024, Risk Analysis, doi:10.1111/risa.17665) describes RLHF and Constitutional AI as major advancements for deployment safety, while noting that formal guarantees remain impossible for models of this complexity. The same training that installs refusal behavior also installs instruction-following behavior at higher fidelity than un-fine-tuned models. A model trained to follow instructions reliably will follow injected instructions reliably. The capability RLHF installs (instruction-following) is the same capability that prompt injection exploits. The jailbreaking vs prompt injection analysis covers the full implications of this paradox.

The Reward Model as a Surrogate and Its Limitations

The reward model is a critical point of fragility in the RLHF pipeline. It is trained on human preference data collected from a specific annotator population, at a specific time, for a specific task distribution. The behavioral policy the language model develops is no better than the reward model it optimized against. Several failure modes have been documented.

Reward hacking occurs when the language model learns to produce completions that score highly on the reward model without actually satisfying the intended preference. The reward model is an imperfect approximation of human preference, and the language model is optimizing against the approximation, not the underlying preference. For long enough training, models develop strategies that exploit gaps between the reward model’s evaluation and what humans actually want. Overlong responses, sycophancy (agreeing with the user’s stated opinions regardless of accuracy), and excessive verbosity are all documented forms of reward hacking.

Annotator population bias occurs because the reward model reflects the preferences of whoever provided the training comparisons. If those annotators share particular cultural, ideological, or professional backgrounds, the model will reflect those backgrounds. The degree to which this is a safety problem (the model is trained on values that do not generalize) versus a capability problem (the model is helpful for some users and not others) depends on the application domain.

Distributional shift is the fundamental limitation: the reward model was trained on the distribution of prompts and completions that existed during training. For prompts far outside this distribution (novel jailbreak techniques, emerging topics, domain-specific adversarial inputs), the reward model’s signal is unreliable. The RL fine-tuning that produced the language model’s behavioral policy was guided by this unreliable signal for out-of-distribution inputs.

How Alignment Training Shapes the Security Surface

For practical security analysis, the RLHF training process creates the model layer of the LLM security surface. Jailbreaking attacks are attacks against this layer: they attempt to produce outputs that the training made low probability. Their success depends on finding prompts that push the model into parts of its distribution where the training signal was weak or inconsistent.

The empirical evidence from frontier model deployments suggests that RLHF provides substantial but not complete jailbreak resistance for common harm categories. OpenAI’s GPT-5.4 system card reports 99.5%+ not_unsafe rates across most harm categories, a figure that reflects years of iterative RLHF improvement. The remaining failure cases are the domain of ongoing jailbreak research: finding the specific prompt structures that elicit unsafe completions despite the training.

Constitutional AI adds the principle-grounding mechanism that makes the training more legible: a published constitution allows external parties to evaluate whether the model’s behavior is consistent with its stated principles, which is not possible for pure RLHF where the preferences are implicit in the annotator comparisons. Whether legibility translates to stronger behavioral guarantees is an open question, but it enables a different kind of accountability than RLHF alone.

The interaction between alignment training and prompt injection is the key security implication. Alignment training makes models better at following instructions by making them more responsive to in-context guidance. This is the same mechanism that indirect prompt injection exploits. An injection that succeeds in adding authoritative-looking instructions to the model’s context is exploiting the same statistical disposition that allows the model to follow system prompt instructions in the first place. Better alignment produces better instruction-following, which produces better injection vectors. The defense is architectural, not model-level, as analyzed in the indirect prompt injection mechanism analysis.

What the Next Generation of Alignment Training Is Attempting

The field has not stood still since the original RLHF paper. Process-level supervision (training the reward model to evaluate individual reasoning steps, not just final outputs) has shown promise for improving reasoning capability and reducing certain forms of reward hacking. Constitutional AI with chain-of-thought self-critique has been extended with more structured principle hierarchies and more explicit harm avoidance criteria.

Interpretability research is attempting to understand which circuits in the model’s weights encode specific behavioral dispositions, with the goal of verifying that safety training has been correctly installed and identifying cases where it has not. If circuits implementing specific refusal behaviors can be identified and their activation patterns verified, the black-box character of alignment training begins to open up toward something closer to formal verification. The gap between the model’s weights reflect the training signal and the model will behave safely under all inputs remains wide, but interpretability research is providing tools for narrowing it.

The practical upshot for teams deploying LLMs in production is that alignment training is the model provider’s defense layer, not the application developer’s. The model provider runs RLHF and CAI. The application developer builds on the result. The result is substantial but imperfect jailbreak resistance, no resistance to prompt injection, and behavioral policies that can be probed and mapped by patient adversaries. Application-layer defenses, covered in the red-teaming methodology and the Gandalf D-SEC framework, operate on top of this foundation and address the attack surfaces that alignment training leaves open.

How RLHF and Constitutional AI Build Safety Into Language Models

The Problem RLHF Solves

The Three-Stage RLHF Pipeline

Direct Preference Optimization: The PPO Alternative

Constitutional AI: Self-Critique With Principles

What the Training Installs and What It Cannot

The Reward Model as a Surrogate and Its Limitations

How Alignment Training Shapes the Security Surface

What the Next Generation of Alignment Training Is Attempting

Share this:

Like this:

More posts

MITRE ATLAS: The ATT&CK Framework for AI Systems

Neural Backdoor Attacks: From BadNets to LLM Trojans

LLM Watermarking: How Models Embed Detection Signals in Their Outputs

Differential Privacy for LLMs: The Training Privacy Guarantee

Discover more from My Written Word