CLAILGOct 25, 2023

Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

arXiv:2310.16790v12 citationsh-index: 52
Originality Incremental advance
AI Analysis

This addresses the challenge of high annotation costs in real-world NER applications, offering a cost-effective solution for improving models trained on noisy data, though it is incremental in nature.

The paper tackles the problem of training named entity recognition models on noisy labeled data by proposing a method to denoise it using a small set of clean instances, resulting in consistent performance improvements on public datasets.

To achieve state-of-the-art performance, one still needs to train NER models on large-scale, high-quality annotated data, an asset that is both costly and time-intensive to accumulate. In contrast, real-world applications often resort to massive low-quality labeled data through non-expert annotators via crowdsourcing and external knowledge bases via distant supervision as a cost-effective alternative. However, these annotation methods result in noisy labels, which in turn lead to a notable decline in performance. Hence, we propose to denoise the noisy NER data with guidance from a small set of clean instances. Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights. The discriminator is capable of detecting both span and category errors with different discriminative prompts. Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes