Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning
This work addresses a computational bottleneck for researchers and practitioners using cooperative game theory for data valuation in machine learning, representing an incremental improvement.
The paper tackles the computational inefficiency of Shapley value and Least core methods for data valuation by learning to estimate model performance on unseen data subsets, resulting in significantly improved accuracy in estimation.
The Shapley value (SV) and Least core (LC) are classic methods in cooperative game theory for cost/profit sharing problems. Both methods have recently been proposed as a principled solution for data valuation tasks, i.e., quantifying the contribution of individual datum in machine learning. However, both SV and LC suffer computational challenges due to the need for retraining models on combinatorially many data subsets. In this work, we propose to boost the efficiency in computing Shapley value or Least core by learning to estimate the performance of a learning algorithm on unseen data combinations. Theoretically, we derive bounds relating the error in the predicted learning performance to the approximation error in SV and LC. Empirically, we show that the proposed method can significantly improve the accuracy of SV and LC estimation.