LGAug 14, 2025

Efficiently Verifiable Proofs of Data Attribution

Ari Karchmer, Martin Pawelczyk, Seth Neel

arXiv:2508.10866v21 citationsh-index: 16

Originality Highly original

AI Analysis

This addresses a critical trust problem for resource-constrained parties in ML, such as in data pricing applications, by enabling efficient verification of computationally expensive data attributions.

The paper tackles the trust issue in data attribution by introducing an interactive verification protocol that allows a resource-constrained Verifier to efficiently check the correctness of data attributions provided by a computationally powerful Prover, with guarantees that the attributions are ε-close to optimal in Mean Squared Error with probability 1-δ and the Verifier's workload scales as O(1/ε) independent of dataset size.

Data attribution methods aim to answer useful counterfactual questions like "what would a ML model's prediction be if it were trained on a different dataset?" However, estimation of data attribution models through techniques like empirical influence or "datamodeling" remains very computationally expensive. This causes a critical trust issue: if only a few computationally rich parties can obtain data attributions, how can resource-constrained parties trust that the provided attributions are indeed "good," especially when they are used for important downstream applications (e.g., data pricing)? In this paper, we address this trust issue by proposing an interactive verification paradigm for data attribution. An untrusted and computationally powerful Prover learns data attributions, and then engages in an interactive proof with a resource-constrained Verifier. Our main result is a protocol that provides formal completeness, soundness, and efficiency guarantees in the sense of Probably-Approximately-Correct (PAC) verification. Specifically, if both Prover and Verifier follow the protocol, the Verifier accepts data attributions that are ε-close to the optimal data attributions (in terms of the Mean Squared Error) with probability 1-δ. Conversely, if the Prover arbitrarily deviates from the protocol, even with infinite compute, then this is detected (or it still yields data attributions to the Verifier) except with probability δ. Importantly, our protocol ensures the Verifier's workload, measured by the number of independent model retrainings it must perform, scales only as O(1/ε); i.e., independently of the dataset size. At a technical level, our results apply to efficiently verifying any linear function over the boolean hypercube computed by the Prover, making them broadly applicable to various attribution tasks.

View on arXiv PDF

Similar