AI CLFeb 16

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik

arXiv:2602.15143v17.55 citationsh-index: 21

Originality Incremental advance

AI Analysis

This addresses the issue of protecting proprietary LLMs from unfair exploitation, offering a practical solution for model developers, though it is incremental in building on existing distillation and watermarking techniques.

The paper tackles the problem of unauthorized knowledge distillation from large language models (LLMs) by modifying teacher-generated reasoning traces to deter such use, achieving strong anti-distillation effects and enabling reliable watermark detection with no false alarms.

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.

View on arXiv PDF

Similar