CLMay 6

Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks

Mohd Ruhul Ameen, Akif Islam, Nadim Mahmud, Md. Ekramul Hamid

arXiv:2605.0550316.1h-index: 17

AI Analysis

For developers deploying watermark-based detection of diffusion LM text, the paper reveals that repeated rewriting is a far stronger attack than single rewrites, undermining current watermark robustness claims.

The paper shows that multi-step rewriting attacks can effectively remove diffusion language model watermarks: after five chained rewrites, detection drops from 87.9% to 4.86%, with 94.76% of originally detected texts no longer flagged.

Statistical watermarking is a common approach for verifying whether text was written by a language model. Most existing schemes assume autoregressive generation, where tokens are produced left to right and contextual hashing is well defined. Diffusion language models generate text by denoising tokens in arbitrary order, so these schemes cannot be applied directly. A recent watermark by Gloaguen et al. addresses this gap for LLaDA 8B Instruct and reports true positive detection above 99%. This paper studies what happens when watermarked text is rewritten not once but several times. Using the same watermark configuration, 1,605 watermarked completions of about 300 tokens each are produced across five WaterBench domains. Each completion is rewritten by four open weight language models, from 1.5B to 8B parameters, none of which know the watermark key. Five rewrite styles are tested: paraphrase, humanize, simplify, academic, and summarize expand. Each style is chained for up to five hops, producing 160,500 rewritten texts in total. The watermark is detected on 87.9% of the original outputs at the standard significance threshold. After a single rewrite, detection falls to between 14% and 41% depending on the rewriter and style. After five chained rewrites, detection falls to 4.86%, meaning 94.76% of the originally detected texts are no longer flagged. After three rewrites, the detector score has dropped 86% of the way from its watermarked baseline toward the null distribution. Repeated rewriting is therefore a much stronger attack than a single rewrite, and the result holds across all four rewriters tested.

View on arXiv PDF

Similar