Unlocking Post-hoc Dataset Inference with Synthetic Data
This work addresses the problem of verifying intellectual property rights for data owners in AI, enabling practical dataset inference for real-world litigations, though it is incremental as it builds on existing DI methods by overcoming a key limitation.
The paper tackles the challenge of dataset inference (DI) for detecting unauthorized use of training data by addressing the need for in-distribution held-out data, which is rarely available; it proposes generating synthetic data via a suffix-based completion task and post-hoc calibration, achieving high detection confidence and low false positive rates in experiments on diverse text datasets.
The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners' intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method's reliability for real-world litigations. Our code is available at https://github.com/sprintml/PostHocDatasetInference.