Ghost in the Transformer: Tracing LLM Lineage with SVD-Fingerprint
This provides a practical solution for model verification and reuse tracking, protecting intellectual property and fostering transparency in the LLM ecosystem, though it is incremental as it builds on existing fingerprinting and SVD techniques.
The paper tackles the problem of verifying the lineage of large language models (LLMs) to address intellectual property concerns, proposing GhostSpec, a data-free method that uses SVD on attention weight matrices to create fingerprints, which reliably traces transformed models with minimal overhead.
Large Language Models (LLMs) have rapidly advanced and are widely adopted across diverse fields. Due to the substantial computational cost and data requirements of training from scratch, many developers choose to fine-tune or modify existing open-source models. While most adhere to open-source licenses, some falsely claim original training despite clear derivation from public models. This raises pressing concerns about intellectual property protection and highlights the need for reliable methods to verify model provenance. In this paper, we propose GhostSpec, a lightweight yet effective method for verifying LLM lineage without access to training data or modification of model behavior. Our approach constructs compact and robust fingerprints by applying singular value decomposition (SVD) to invariant products of internal attention weight matrices, effectively capturing the structural identity of a model. Unlike watermarking or output-based methods, GhostSpec is fully data-free, non-invasive, and computationally efficient. It demonstrates strong robustness to sequential fine-tuning, pruning, block expansion, and even adversarial transformations. Extensive experiments show that GhostSpec can reliably trace the lineage of transformed models with minimal overhead. By offering a practical solution for model verification and reuse tracking, our method contributes to the protection of intellectual property and fosters a transparent, trustworthy ecosystem for large-scale language models.