CRCLLGNov 8, 2024

Revisiting the Robustness of Watermarking to Paraphrasing Attacks

arXiv:2411.05277v128 citationsh-index: 11EMNLP
Originality Incremental advance
AI Analysis

This addresses a critical security issue for AI-generated content detection, highlighting a significant vulnerability in current watermarking schemes.

The paper tackles the problem of watermarking techniques for language models being vulnerable to reverse-engineering, showing that with limited access to a black-box watermarked model, paraphrasing attacks can drastically increase evasion of watermark detection, rendering it ineffective.

Amidst rising concerns about the internet being proliferated with content generated from language models (LMs), watermarking is seen as a principled way to certify whether text was generated from a model. Many recent watermarking techniques slightly modify the output probabilities of LMs to embed a signal in the generated output that can later be detected. Since early proposals for text watermarking, questions about their robustness to paraphrasing have been prominently discussed. Lately, some techniques are deliberately designed and claimed to be robust to paraphrasing. However, such watermarking schemes do not adequately account for the ease with which they can be reverse-engineered. We show that with access to only a limited number of generations from a black-box watermarked model, we can drastically increase the effectiveness of paraphrasing attacks to evade watermark detection, thereby rendering the watermark ineffective.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes