LGJul 19, 2024

Watermark Smoothing Attacks against Language Models

arXiv:2407.14206v23 citationsh-index: 4Has Code
Originality Highly original
AI Analysis

This exposes critical weaknesses in existing watermarking schemes, posing a problem for AI detection systems, and is incremental as it builds on known vulnerabilities.

The paper tackles the vulnerability of AI-generated text watermarks by introducing the Smoothing Attack, a novel removal method that erases watermark traces while preserving text quality, validated on models from 1.3B to 30B parameters across 10 watermarks.

Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model's confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from $1.3$B to $30$B parameters on $10$ different watermarks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes