LG CL MLJun 16, 2024

Data Shapley in One Training Run

Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

arXiv:2406.11011v332.274 citations

Originality Incremental advance

AI Analysis

This addresses the problem of computationally intensive data attribution for large-scale models, offering a practical solution for researchers and practitioners in machine learning, though it is incremental as it builds on the Data Shapley framework.

The paper tackles the computational inefficiency and lack of targeted attribution in existing Data Shapley methods by introducing In-Run Data Shapley, which provides scalable data attribution for a specific model with negligible additional runtime, enabling its first application to foundation model pretraining.

Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.

View on arXiv PDF

Similar