CLFeb 17, 2025

Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

Tsinghua
arXiv:2502.11598v211 citationsh-index: 25Has CodeACL
Originality Incremental advance
AI Analysis

This work highlights a vulnerability in LLM watermarking for preventing unauthorized knowledge distillation, showing that current defenses are not robust against adversarial attacks.

The paper investigates whether student models can avoid inheriting watermarks from teacher models during knowledge distillation, finding that targeted paraphrasing and watermark neutralization methods effectively eliminate watermarks while maintaining knowledge transfer.

The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies. Our code is available at https://github.com/THU-BPM/Watermark-Radioactivity-Attack.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes