LLM Watermarking: How Models Embed Detection Signals in Their Outputs

LLM Watermarking: How Models Embed Detection Signals in Their Outputs
LLM Watermarking: How Models Embed Detection Signals in Their Outputs

As large language models generate text increasingly difficult to distinguish from human writing by style or quality alone, the technical problem of attribution has become urgent: given a document, can you determine whether it was generated by a specific model, and if so, which one? Watermarking provides one answer. By embedding a hidden statistical signal into the token distribution during generation, watermarking allows a verifier with knowledge of the secret key to detect model-generated text with high reliability, even after the text has been edited, paraphrased, or otherwise modified.

The use cases span several security-relevant domains. Content provenance tracking connects watermarking to the supply chain problem: a watermarked model is a model whose outputs can be traced to their source. AI detection in high-stakes settings (academic integrity, journalism, legal filings) depends on detection reliability under adversarial paraphrasing. And intellectual property protection for model providers creates an economic incentive to deploy watermarking as a technical complement to legal copyright enforcement.

The Green-Red Token List Scheme

The dominant watermarking technique for autoregressive language models is the green-red token list scheme introduced by Kirchenbauer, Geiping, Wen, Kaddour, Hopkins, and Goldstein (2023, ICML 2023). The scheme works by partitioning the model’s vocabulary into a green list and a red list using a secret cryptographic key that is a function of the preceding token sequence. During generation, the model is biased to produce tokens from the green list, leaving a detectable statistical signal in the output.

At detection time, a verifier who knows the secret key recomputes the green-red partition for each position in the text and counts how many tokens fall in the green list. Under the null hypothesis (text was not generated with the watermark), each token is equally likely to be in the green or red list, and the fraction of green tokens should be approximately 0.5. Watermarked text will have a significantly higher green fraction, detectable by a one-sided hypothesis test with a false positive rate controlled by the threshold choice.

The scheme’s key property is that it is statistically unbiased in expectation: the watermarking bias is toward producing tokens that the green list includes, but for any green list, there are tokens with high probability under the model’s distribution that are on the green list and high-probability tokens that are on the red list. The bias shifts the generation distribution without fundamentally changing what the model can say. For high-probability outputs, the watermark is essentially invisible to human readers.

For low-probability outputs (texts where the model is uncertain and the green-red partition strongly influences which token is selected), the watermark introduces a more noticeable shift. This is the watermark’s fundamental tradeoff: stronger watermarks (larger bias toward green tokens) are more detectable and more resilient to editing, but more visibly alter the model’s output distribution for uncertain positions. The bias parameter controls this tradeoff.

The EMS Scheme: Exponential Minimum Sampling

Scott Aaronson (2023) proposed the Exponential Minimum Sampling (EMS) watermarking scheme as an unbiased alternative to the green-red list approach. Li, Liu, and Li (2025, Stat, doi:10.1002/sta4.70118) describe the EMS mechanism: to generate the i-th token, the scheme independently samples xi_ik from Uniform[0,1] for each token k in the vocabulary. The token y_i is then chosen as the argmin over k of the ratio (-log xi_ik) / p(k | preceding context), where p(k | preceding context) is the model’s probability for token k given the current sequence.

The unbiased property follows from a property of the exponential distribution: the probability that token k is selected under EMS equals the model’s original probability p(k | preceding context). The watermark is embedded not by changing which tokens are likely, but by using the secret random samples to create a detectable correlation between the text and the key.

The practical advantage of EMS is that the generated text is distributional identical to text generated without the watermark. This makes EMS more resilient to adversarial paraphrasing detectors that try to identify watermarks by finding statistical deviations from the model’s natural distribution. The disadvantage is that EMS requires access to the full vocabulary probability distribution at each step.

The ITS Method: Inverse Transform Sampling

Kuditipudi, Thickstun, Hashimoto, Liang, and Steinhardt (2023) introduced the Inverse Transform Sampling (ITS) watermarking method, which generalizes the EMS approach by using a different correlation statistic for detection. Li et al. (2025, doi:10.1002/sta4.70118) formalize the ITS detection statistic measuring whether token positions correlate with the random key in a way that is improbable under random chance.

The ITS method’s contribution is extending watermark detection from the binary problem (is this text watermarked?) to the segmentation problem (which substrings of this text are watermarked?). Li et al. (2025) formulate segmentation as a change-point detection problem, using adaptive test statistics to identify watermarked substrings within mixed text that contains both watermarked and non-watermarked content. This is practically important for documents that mix human-written and AI-generated sections.

Li et al.’s (2025, doi:10.1002/sta4.70118) adaptive framework extends the likelihood-based LLM detection method by introducing a flexible weighted formulation and removes the need for precise prompt estimation that makes previous segmentation methods fragile. Extensive numerical experiments demonstrate that the proposed methodology is both effective and accurate at segmenting texts containing a mixture of watermarked and non-watermarked content.

Attacks on Watermarks: Paraphrasing, Translation, and Obfuscation

The security of any watermarking scheme depends on its resilience to adversarial manipulation. An attacker who knows a watermark has been embedded will attempt to remove it while preserving the text’s meaning. The three main attack classes are paraphrasing (rewriting the text in different words), translation (converting to another language and back), and character-level manipulation.

Ardito (2024, New Directions for Teaching and Learning, doi:10.1002/tl.20624) notes that while researchers have proposed watermarking approaches that are more resilient to standard attacks than simple alternating-word-list schemes, the fundamental vulnerability remains: a recent paper proved that no watermark is immune to obfuscation through paraphrasing (Zhang et al., 2023). The theoretical result is a lower bound on the attack surface: any watermarking scheme that embeds statistical signals in token choice can be defeated by an adversary willing to perform enough paraphrasing passes through a sufficiently capable paraphrase model.

The practical severity of this vulnerability depends on the context. For casual content creation and academic integrity checking, most users will not run sophisticated paraphrasing attacks, and watermarking provides substantial detection coverage. For adversaries with significant compute and motivation (state-level disinformation campaigns, organized content farms), paraphrasing attacks are feasible and effectively remove most current watermarks.

The resilience-detection tradeoff in watermarking is structurally similar to the security-utility tradeoff in injection defenses documented in the Gandalf the Red research: stronger defenses degrade utility, and a sufficiently motivated attacker can often find a path around them. The value of watermarking lies in raising the cost of attack and providing detection capability for the large majority of non-adversarial cases, not in providing perfect security against all adversaries.

Watermarking as a Supply Chain Tool

The supply chain implications of watermarking extend the analysis in the LLM supply chain attack analysis. A model whose outputs are watermarked can be identified as the source of specific documents, which has two supply chain security applications.

First, model authentication: if a legitimate model provider watermarks all outputs with a key that third parties can verify, downstream users can verify that outputs they receive actually came from the claimed model and have not been replaced by outputs from a poisoned substitute. This is the model provenance application: watermarks as digital signatures on model outputs, analogous to code signing for compiled software.

Second, output traceability: if an organization’s LLM deployment produces watermarked outputs, and those outputs appear in unexpected places (leaked documents, competitor products, third-party aggregators), the watermark provides forensic evidence of the source. This application is most relevant for organizations whose LLM outputs constitute proprietary intellectual property.

The limitation identified by Ardito (2024) and Zhang et al. (2023) applies here: a sophisticated supply chain attacker who knows the watermarking scheme and has access to a capable paraphrase model can strip the watermark before distributing poisoned outputs. Watermarking is a forensic tool that raises the cost of undetected supply chain compromise, not a cryptographic seal that makes compromise impossible.

Regulatory Context and Disclosure Requirements

The EU AI Act (2024) and emerging regulatory frameworks in the United States, United Kingdom, and China all include provisions for AI content disclosure and provenance. Technical watermarking is one of the mechanisms cited in regulatory guidance as a way to implement disclosure requirements at scale: rather than requiring human labeling of AI-generated content (which is impractical at generation scale), watermarking allows automated disclosure through detection systems that check incoming content against known model signatures.

The gap between regulatory intent and technical capability is real. The EU AI Act’s technical specifications for watermarking are not yet finalized, and no current watermarking scheme achieves both high detection reliability and strong paraphrase resistance simultaneously. Ardito (2024) argues that reliance on detection mechanisms is misaligned with the educational landscape and advocates for strategic shifts toward assessment methods that embrace AI usage, rather than attempting to detect and penalize it. This policy argument applies beyond education: the fundamental undetectability result suggests that detection-based compliance strategies will face increasing adversarial pressure as paraphrasing models improve.

Multi-Model Watermarking and Mixed Provenance

A practical complication for watermarking in deployed environments is that content increasingly mixes outputs from multiple models. A document may contain text generated by GPT-5.4 for one section, Claude Sonnet 4.6 for another, and human writing in between. If each model uses its own watermarking scheme with its own key, detecting the full provenance of the document requires running detection for each candidate watermarking scheme. The mixed-watermark detection problem becomes substantially harder than single-watermark detection.

Li et al. (2025, doi:10.1002/sta4.70118) identify this as an important direction for future research: scenarios in which published texts contain mixed watermarked content generated by different LLMs employing distinct watermarking schemes introduce the challenge of segmenting text into substrings attributable to different LLMs, each characterized by its own sequence of keys, p-values, and change points. Algorithms capable of aggregating and analyzing such results represent a promising research avenue.

The practical deployment implication is that watermarking standards will need cross-model coordination. The C2PA (Coalition for Content Provenance and Authenticity) standard provides one approach: metadata-based provenance assertion that operates at the document level rather than the token level. C2PA-style approaches do not require watermark detection within the content itself; they require signed metadata attached to the content. The tradeoff is that metadata can be stripped while content survives, making C2PA more vulnerable to deliberate provenance erasure than embedded watermarking.

What Watermarking Can and Cannot Do

Watermarking is not a content authenticity guarantee. A watermarked document is one where a specific model generated the text at a specific time with a specific key. It says nothing about whether the content is accurate, whether the content has been selectively edited to remove factual claims, or whether the framing of the content is misleading. Watermarking addresses origin, not veracity.

It also cannot be applied retroactively. Models that were deployed before watermarking was implemented cannot have watermarks inserted into their historical outputs. The large volume of AI-generated content already circulating online without watermarks represents a baseline that detection systems must treat as potentially AI-generated regardless of watermark status.

Within its scope, watermarking is technically mature and practically deployable for most non-adversarial contexts. Li et al.’s (2025) adaptive detection framework provides the current state of the art for mixed-content segmentation, and the EMS and ITS schemes provide unbiased generation with detectable statistical signatures. For organizations deploying LLMs in contexts where output provenance matters, the deployment decision is not whether to watermark but which scheme provides the detection reliability and resilience required for their specific use case and adversarial environment. This deployment decision is ultimately a tradeoff between detection coverage, paraphrase resilience, and the computational overhead of running detection at scale across all incoming content. For the broader context of model provenance and supply chain integrity, see the LLM supply chain attack analysis and the training data memorization analysis.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading