Detecting Mislabeled and Corrupted Data via Pointwise Mutual Information
This addresses data quality issues for deep learning practitioners, but it is incremental as it builds on existing mutual information concepts for a known bottleneck.
The paper tackles the problem of data quality in deep learning by proposing a mutual information-based framework to detect mislabeled and corrupted data, showing that training on high-MI samples improves classification accuracy by up to 15% under label corruption compared to random sampling.
Deep neural networks can memorize corrupted labels, making data quality critical for model performance, yet real-world datasets are frequently compromised by both label noise and input noise. This paper proposes a mutual information-based framework for data selection under hybrid noise scenarios that quantifies statistical dependencies between inputs and labels. We compute each sample's pointwise contribution to the overall mutual information and find that lower contributions indicate noisy or mislabeled instances. Empirical validation on MNIST with different synthetic noise settings demonstrates that the method effectively filters low-quality samples. Under label corruption, training on high-MI samples improves classification accuracy by up to 15\% compared to random sampling. Furthermore, the method exhibits robustness to benign input modifications, preserving semantically valid data while filtering truly corrupted samples.