On the Usage of Gaussian Process for Efficient Data Valuation
This work addresses the fundamental problem of quantifying data impact for machine learning practitioners, though it appears incremental as it builds on existing literature.
The paper tackles the problem of data valuation in machine learning by proposing a canonical decomposition of valuation methods into utility functions and aggregation procedures, and introduces Gaussian Processes to efficiently compute utility on sub-models. The result is a theoretically grounded approach that enables fast estimation of data valuations using efficient update formulae.
In machine learning, knowing the impact of a given datum on model training is a fundamental task referred to as Data Valuation. Building on previous works from the literature, we have designed a novel canonical decomposition allowing practitioners to analyze any data valuation method as the combination of two parts: a utility function that captures characteristics from a given model and an aggregation procedure that merges such information. We also propose to use Gaussian Processes as a means to easily access the utility function on ``sub-models'', which are models trained on a subset of the training set. The strength of our approach stems from both its theoretical grounding in Bayesian theory, and its practical reach, by enabling fast estimation of valuations thanks to efficient update formulae.