Distantly Supervised Named Entity Recognition via Confidence-Based Multi-Class Positive and Unlabeled Learning
This work addresses the challenge of incomplete annotations in distantly supervised NER, which is important for researchers and practitioners in natural language processing, though it is incremental as it builds on existing MPU learning frameworks.
The paper tackles the problem of named entity recognition under distant supervision, where training data often has high false negative rates due to incomplete external dictionaries, by proposing a confidence-based multi-class positive and unlabeled learning approach that achieves superior performance over existing methods on benchmark datasets.
In this paper, we study the named entity recognition (NER) problem under distant supervision. Due to the incompleteness of the external dictionaries and/or knowledge bases, such distantly annotated training data usually suffer from a high false negative rate. To this end, we formulate the Distantly Supervised NER (DS-NER) problem via Multi-class Positive and Unlabeled (MPU) learning and propose a theoretically and practically novel CONFidence-based MPU (Conf-MPU) approach. To handle the incomplete annotations, Conf-MPU consists of two steps. First, a confidence score is estimated for each token of being an entity token. Then, the proposed Conf-MPU risk estimation is applied to train a multi-class classifier for the NER task. Thorough experiments on two benchmark datasets labeled by various external knowledge demonstrate the superiority of the proposed Conf-MPU over existing DS-NER methods.