LoRAGuard: An Effective Black-box Watermarking Approach for LoRAs
This addresses the need for traceability in parameter-efficient fine-tuning of large models, particularly for preventing harmful content generation, though it is an incremental improvement over existing watermarking methods.
The paper tackles the problem of unauthorized use of LoRAs for generating harmful content by introducing LoRAGuard, a black-box watermarking technique that achieves nearly 100% verification success in detecting misuse, even when multiple LoRAs are combined or negation operations are applied.
LoRA (Low-Rank Adaptation) has achieved remarkable success in the parameter-efficient fine-tuning of large models. The trained LoRA matrix can be integrated with the base model through addition or negation operation to improve performance on downstream tasks. However, the unauthorized use of LoRAs to generate harmful content highlights the need for effective mechanisms to trace their usage. A natural solution is to embed watermarks into LoRAs to detect unauthorized misuse. However, existing methods struggle when multiple LoRAs are combined or negation operation is applied, as these can significantly degrade watermark performance. In this paper, we introduce LoRAGuard, a novel black-box watermarking technique for detecting unauthorized misuse of LoRAs. To support both addition and negation operations, we propose the Yin-Yang watermark technique, where the Yin watermark is verified during negation operation and the Yang watermark during addition operation. Additionally, we propose a shadow-model-based watermark training approach that significantly improves effectiveness in scenarios involving multiple integrated LoRAs. Extensive experiments on both language and diffusion models show that LoRAGuard achieves nearly 100% watermark verification success and demonstrates strong effectiveness.