CLLGOct 7, 2022

Distillation-Resistant Watermarking for Model Protection in NLP

BerkeleyCMU
arXiv:2210.03312v2300 citationsh-index: 60
AI Analysis

This addresses model protection for NLP practitioners against stealing via distillation, representing a novel method for a known bottleneck.

The paper tackles the problem of protecting NLP models from theft via distillation by proposing Distillation-Resistant Watermarking (DRW), which injects watermarks into prediction probabilities and achieves 100% mean average precision in detecting stealing suspects across four NLP tasks.

How can we protect the intellectual property of trained NLP models? Modern NLP models are prone to stealing by querying and distilling from their publicly exposed APIs. However, existing protection methods such as watermarking only work for images but are not applicable to text. We propose Distillation-Resistant Watermarking (DRW), a novel technique to protect NLP models from being stolen via distillation. DRW protects a model by injecting watermarks into the victim's prediction probability corresponding to a secret key and is able to detect such a key by probing a suspect model. We prove that a protected model still retains the original accuracy within a certain bound. We evaluate DRW on a diverse set of NLP tasks including text classification, part-of-speech tagging, and named entity recognition. Experiments show that DRW protects the original model and detects stealing suspects at 100% mean average precision for all four tasks while the prior method fails on two.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes