LossVal: Efficient Data Valuation for Neural Networks
This addresses the challenge of data valuation for practitioners dealing with large datasets, though it is incremental as it builds on existing loss function modifications.
The paper tackles the problem of efficiently assessing the importance of individual training samples in neural networks by introducing LossVal, a method that embeds a self-weighting mechanism into loss functions to compute importance scores during training, reducing computational costs and effectively identifying noisy samples across classification and regression tasks.
Assessing the importance of individual training samples is a key challenge in machine learning. Traditional approaches retrain models with and without specific samples, which is computationally expensive and ignores dependencies between data points. We introduce LossVal, an efficient data valuation method that computes importance scores during neural network training by embedding a self-weighting mechanism into loss functions like cross-entropy and mean squared error. LossVal reduces computational costs, making it suitable for large datasets and practical applications. Experiments on classification and regression tasks across multiple datasets show that LossVal effectively identifies noisy samples and is able to distinguish helpful from harmful samples. We examine the gradient calculation of LossVal to highlight its advantages. The source code is available at: https://github.com/twibiral/LossVal