CLAIJan 21, 2021

Validating Label Consistency in NER Data Annotation

arXiv:2101.08698v2661 citations
AI Analysis

This addresses label quality issues for NER practitioners, but it is incremental as it builds on existing validation techniques.

The paper tackled the problem of label inconsistency in NER data annotation by presenting an empirical method to explore its relationship with model performance, identifying label mistakes of 26.7% and 5.4% in SCIERC and CoNLL03 datasets and validating consistency in corrected versions.

Data annotation plays a crucial role in ensuring your named entity recognition (NER) projects are trained with the right information to learn from. Producing the most accurate labels is a challenge due to the complexity involved with annotation. Label inconsistency between multiple subsets of data annotation (e.g., training set and test set, or multiple training subsets) is an indicator of label mistakes. In this work, we present an empirical method to explore the relationship between label (in-)consistency and NER model performance. It can be used to validate the label consistency (or catches the inconsistency) in multiple sets of NER data annotation. In experiments, our method identified the label inconsistency of test data in SCIERC and CoNLL03 datasets (with 26.7% and 5.4% label mistakes). It validated the consistency in the corrected version of both datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes