LGJul 20, 2023

Differences Between Hard and Noisy-labeled Samples: An Empirical Study

arXiv:2307.10718v16.64 citationsh-index: 56Has Code

Originality Incremental advance

AI Analysis

This work addresses a practical issue in machine learning for researchers and practitioners dealing with noisy data, though it is incremental as it builds on existing methods for label noise and hard samples.

The paper tackles the problem of distinguishing between hard-to-learn and incorrectly labeled samples in datasets, introducing a metric that filters noisy labels while retaining hard samples, which improves test accuracy in both synthetic and real-world noisy datasets.

Extracting noisy or incorrectly labeled samples from a labeled dataset with hard/difficult samples is an important yet under-explored topic. Two general and often independent lines of work exist, one focuses on addressing noisy labels, and another deals with hard samples. However, when both types of data are present, most existing methods treat them equally, which results in a decline in the overall performance of the model. In this paper, we first design various synthetic datasets with custom hardness and noisiness levels for different samples. Our proposed systematic empirical study enables us to better understand the similarities and more importantly the differences between hard-to-learn samples and incorrectly-labeled samples. These controlled experiments pave the way for the development of methods that distinguish between hard and noisy samples. Through our study, we introduce a simple yet effective metric that filters out noisy-labeled samples while keeping the hard samples. We study various data partitioning methods in the presence of label noise and observe that filtering out noisy samples from hard samples with this proposed metric results in the best datasets as evidenced by the high test accuracy achieved after models are trained on the filtered datasets. We demonstrate this for both our created synthetic datasets and for datasets with real-world label noise. Furthermore, our proposed data partitioning method significantly outperforms other methods when employed within a semi-supervised learning framework.

View on arXiv PDF Code

Similar