Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy
This work addresses the challenge of reducing computational costs in deep learning by pruning noisy datasets, which is an incremental advance in data pruning for robust learning scenarios.
The paper tackles the problem of data pruning under label noise by proposing Prune4Rel, a novel algorithm that selects a subset to maximize re-labeling accuracy, achieving performance improvements of up to 9.1% over baselines with re-labeling models and up to 21.6% over standard models.
Data pruning, which aims to downsize a large training set into a small informative subset, is crucial for reducing the enormous computational costs of modern deep learning. Though large-scale data collections invariably contain annotation noise and numerous robust learning methods have been developed, data pruning for the noise-robust learning scenario has received little attention. With state-of-the-art Re-labeling methods that self-correct erroneous labels while training, it is challenging to identify which subset induces the most accurate re-labeling of erroneous labels in the entire training set. In this paper, we formalize the problem of data pruning with re-labeling. We first show that the likelihood of a training example being correctly re-labeled is proportional to the prediction confidence of its neighborhood in the subset. Therefore, we propose a novel data pruning algorithm, Prune4Rel, that finds a subset maximizing the total neighborhood confidence of all training examples, thereby maximizing the re-labeling accuracy and generalization performance. Extensive experiments on four real and one synthetic noisy datasets show that \algname{} outperforms the baselines with Re-labeling models by up to 9.1% as well as those with a standard model by up to 21.6%.