CL AIMay 23, 2023

Handling Realistic Label Noise in BERT Text Classification

arXiv:2305.16337v221.1129 citations

Originality Incremental advance

AI Analysis

This addresses label noise issues in NLP for practitioners using BERT, but it is incremental as it builds on existing noise-handling methods.

The paper tackles the problem of realistic label noise, such as feature-dependent noise and annotator disagreements, in BERT text classification, showing that it significantly degrades performance, and evaluates ensembles and noise-cleaning methods to improve robustness across datasets.

Labels noise refers to errors in training labels caused by cheap data annotation methods, such as web scraping or crowd-sourcing, which can be detrimental to the performance of supervised classifiers. Several methods have been proposed to counteract the effect of random label noise in supervised classification, and some studies have shown that BERT is already robust against high rates of randomly injected label noise. However, real label noise is not random; rather, it is often correlated with input features or other annotator-specific factors. In this paper, we evaluate BERT in the presence of two types of realistic label noise: feature-dependent label noise, and synthetic label noise from annotator disagreements. We show that the presence of these types of noise significantly degrades BERT classification performance. To improve robustness, we evaluate different types of ensembles and noise-cleaning methods and compare their effectiveness against label noise across different datasets.

View on arXiv PDF

Similar