CLApr 24, 2025

Unified attacks to large language model watermarks: spoofing and scrubbing in unauthorized knowledge distillation

Xin Yi, Yue Li, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He

arXiv:2504.17480v48.34 citationsh-index: 6Knowledge-Based Systems

Originality Incremental advance

AI Analysis

This work addresses security risks for stakeholders relying on watermarks to protect intellectual property and combat misinformation in AI, though it is incremental as it builds on prior findings about watermark inheritance.

The paper tackles the vulnerability of watermarks in large language models to attacks during unauthorized knowledge distillation, proposing a unified framework that enables both scrubbing (removal) and spoofing (forgery) attacks while preserving model performance, with experiments demonstrating its effectiveness.

Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.

View on arXiv PDF

Similar