CLDec 17, 2024

Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers

arXiv:2412.12563v14 citationsh-index: 11Has Code
Originality Highly original
AI Analysis

This addresses the need for model owners to secure their models against unauthorized use, offering a practical solution that is robust to attacks and applicable across various tasks.

The paper tackles the problem of protecting intellectual property of large language models by proposing a task-agnostic watermarking method using passthrough layers, which achieves near-perfect extraction accuracy and low false-positive rates without harming model performance.

In the era of costly pre-training of large language models, ensuring the intellectual property rights of model owners, and insuring that said models are responsibly deployed, is becoming increasingly important. To this end, we propose model watermarking via passthrough layers, which are added to existing pre-trained networks and trained using a self-supervised loss such that the model produces high-entropy output when prompted with a unique private key, and acts normally otherwise. Unlike existing model watermarking methods, our method is fully task-agnostic, and can be applied to both classification and sequence-to-sequence tasks without requiring advanced access to downstream fine-tuning datasets. We evaluate the proposed passthrough layers on a wide range of downstream tasks, and show experimentally our watermarking method achieves a near-perfect watermark extraction accuracy and false-positive rate in most cases without damaging original model performance. Additionally, we show our method is robust to both downstream fine-tuning, fine-pruning, and layer removal attacks, and can be trained in a fraction of the time required to train the original model. Code is available in the paper.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes