CLMay 28

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Zhihao Wu, Gracia Gong, Qinglin Zhu, Yudong Chen, Runcong Zhao

arXiv:2605.3050196.2

AI Analysis

This work identifies a critical vulnerability in AI-text watermarking for model providers and users, showing that current watermarking schemes are easily defeated in multi-model environments.

This paper demonstrates that watermarks in AI-generated text, which perturb output distributions for detection, are fundamentally vulnerable when users combine outputs from multiple models. By averaging the output probability distributions from 3-5 models, the unwatermarked distribution is recovered, suppressing detection z-scores from 5-300 to below 2 and reducing true positive rates at 5% false positive rate to below 50%, while improving text quality by 27.5% and generation speed by 6 times.

Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR at 5% FPR to below 50%, while improving quality by 27.5% and running 6 times faster than the best baseline on the long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.

View on arXiv PDF

Similar