LLM Watermark Evasion via Bias Inversion
This addresses a critical vulnerability in watermarking for LLMs, which is important for detecting AI-generated text, but the attack is incremental as it builds on existing evasion concerns.
The paper tackled the problem of adversarial evasion of watermarks in large language models by proposing the Bias-Inversion Rewriting Attack (BIRA), which achieved over 99% evasion across recent watermarking methods while preserving semantic content.
Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.