Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning
This work addresses the problem of scalable dataset quality assessment for machine learning practitioners, offering a practical solution for large tabular datasets, though it is incremental as it builds on existing Shapley approximations.
The paper tackles the challenge of scaling Data Shapley computations for large datasets by introducing Chunked Data Shapley (C-DaSh), which divides datasets into chunks and uses optimized methods to estimate contributions, achieving speedups of 80x to 2300x while maintaining accuracy in detecting low-quality data.
As the volume and diversity of available datasets continue to increase, assessing data quality has become crucial for reliable and efficient Machine Learning analytics. A modern, game-theoretic approach for evaluating data quality is the notion of Data Shapley which quantifies the value of individual data points within a dataset. State-of-the-art methods to scale the NP-hard Shapley computation also face severe challenges when applied to large-scale datasets, limiting their practical use. In this work, we present a Data Shapley approach to identify a dataset's high-quality data tuples, Chunked Data Shapley (C-DaSh). C-DaSh scalably divides the dataset into manageable chunks and estimates the contribution of each chunk using optimized subset selection and single-iteration stochastic gradient descent. This approach drastically reduces computation time while preserving high quality results. We empirically benchmark our method on diverse real-world classification and regression tasks, demonstrating that C-DaSh outperforms existing Shapley approximations in both computational efficiency (achieving speedups between 80x - 2300x) and accuracy in detecting low-quality data regions. Our method enables practical measurement of dataset quality on large tabular datasets, supporting both classification and regression pipelines.