CLMar 7

How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection

arXiv:2603.07346v11 citationsHas Code
Predicted impact top 3% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This research addresses the problem of noisy training data degrading language model performance for non-topical classification tasks, particularly for researchers working with crowdsourced annotations and cross-lingual transfer.

The study investigated the impact of various denoising strategies on BERT-based classifiers for sentence-level difficulty detection, using noisy crowdsourced data. For a smaller dataset, GMM-based filtering significantly improved AUC from 0.52 to 0.92, while for a larger dataset, gains were marginal (0.92 to 0.94), indicating BERT's inherent robustness.

Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM-based noise filtering proves particularly effective in improving prediction quality by raising the Area-Under-the-Curve score from 0.52 to 0.92, or to 0.93 when de-noising methods are combined. However, for our larger dataset, the intrinsic regularisation of pre-trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.92 to 0.94, while a combination of two denoising methods made no contribution). Nonetheless, removing noisy sentences (about 20\% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest multilingual corpus for sentence difficulty prediction: see https://github.com/Nouran-Khallaf/denoising-difficulty

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes