LLM Fingerprinting via Semantically Conditioned Watermarks
This addresses the need for reliable model ownership verification for LLM developers, representing an incremental improvement over existing fingerprinting techniques.
The paper tackles the problem of LLM fingerprinting methods being fragile to deployment steps like finetuning and detectable in responses by introducing a semantically conditioned watermarking approach that uses a broad semantic domain and statistical signals, resulting in a fingerprint that is stealthy and robust across common scenarios.
Most LLM fingerprinting methods teach the model to respond to a few fixed queries with predefined atypical responses (keys). This memorization often does not survive common deployment steps such as finetuning or quantization, and such keys can be easily detected and filtered from LLM responses, ultimately breaking the fingerprint. To overcome these limitations we introduce LLM fingerprinting via semantically conditioned watermarks, replacing fixed query sets with a broad semantic domain, and replacing brittle atypical keys with a statistical watermarking signal diffused throughout each response. After teaching the model to watermark its responses only to prompts from a predetermined domain e.g., French language, the model owner can use queries from that domain to reliably detect the fingerprint and verify ownership. As we confirm in our thorough experimental evaluation, our fingerprint is both stealthy and robust to all common deployment scenarios.