LG AIOct 10, 2025

High-Power Training Data Identification with Provable Statistical Guarantees

Zhenlong Liu, Hao Zeng, Weiran Huang, Hongxin Wei

arXiv:2510.09717v17.12 citationsh-index: 3

Originality Highly original

AI Analysis

This addresses the need for reliable training data identification in copyright litigation and privacy auditing, offering a rigorous method with provable guarantees.

The paper tackles the problem of identifying training data in large-scale models with statistical guarantees, introducing PTDI which provides strict false discovery rate control and achieves higher power across various models and datasets.

Identifying training data within large-scale models is critical for copyright litigation, privacy auditing, and ensuring fair evaluation. The conventional approaches treat it as a simple binary classification task without statistical guarantees. A recent approach is designed to control the false discovery rate (FDR), but its guarantees rely on strong, easily violated assumptions. In this paper, we introduce Provable Training Data Identification (PTDI), a rigorous method that identifies a set of training data with strict false discovery rate (FDR) control. Specifically, our method computes p-values for each data point using a set of known unseen data, and then constructs a conservative estimator for the data usage proportion of the test set, which allows us to scale these p-values. Our approach then selects the final set of training data by identifying all points whose scaled p-values fall below a data-dependent threshold. This entire procedure enables the discovery of training data with provable, strict FDR control and significantly boosted power. Extensive experiments across a wide range of models (LLMs and VLMs), and datasets demonstrate that PTDI strictly controls the FDR and achieves higher power.

View on arXiv PDF

Similar