AttnDiff: Attention-based Differential Fingerprinting for Large Language Models
This addresses the need for provenance verification and accountability in protecting LLM intellectual property, though it is an incremental improvement in model fingerprinting techniques.
The paper tackles the problem of verifying intellectual property for open-weight large language models (LLMs) by detecting if a suspect model is derived from a victim model despite laundering operations like fine-tuning or pruning, and it achieves high similarity scores (e.g., >0.98 vs. <0.22) using a data-efficient fingerprinting method.
Protecting the intellectual property of open-weight large language models (LLMs) requires verifying whether a suspect model is derived from a victim model despite common laundering operations such as fine-tuning (including PPO/DPO), pruning/compression, and model merging. We propose \textsc{AttnDiff}, a data-efficient white-box framework that extracts fingerprints from models via intrinsic information-routing behavior. \textsc{AttnDiff} probes minimally edited prompt pairs that induce controlled semantic conflicts, captures differential attention patterns, summarizes them with compact spectral descriptors, and compares models using CKA. Across Llama-2/3 and Qwen2.5 (3B--14B) and additional open-source families, it yields high similarity for related derivatives while separating unrelated model families (e.g., $>0.98$ vs.\ $<0.22$ with $M=60$ probes). With 5--60 multi-domain probes, it supports practical provenance verification and accountability.