LG CL MLMay 1, 2020

Partially-Typed NER Datasets Integration: Connecting Practice to Theory

Shi Zhi, Liyuan Liu, Yu Zhang, Shiyin Wang, Qi Li, Chao Zhang, Jiawei Han

arXiv:2005.00502v12.31 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of integrating incomplete NER datasets for practitioners, though it is incremental as it builds on existing efforts without introducing a new method.

The paper tackles the problem of training named entity recognition (NER) models using multiple partially-typed datasets instead of fully-typed ones, showing through theoretical bounds and controlled experiments that models can achieve similar performance to those trained with fully-typed annotations.

While typical named entity recognition (NER) models require the training set to be annotated with all target types, each available datasets may only cover a part of them. Instead of relying on fully-typed NER datasets, many efforts have been made to leverage multiple partially-typed ones for training and allow the resulting model to cover a full type set. However, there is neither guarantee on the quality of integrated datasets, nor guidance on the design of training algorithms. Here, we conduct a systematic analysis and comparison between partially-typed NER datasets and fully-typed ones, in both theoretical and empirical manner. Firstly, we derive a bound to establish that models trained with partially-typed annotations can reach a similar performance with the ones trained with fully-typed annotations, which also provides guidance on the algorithm design. Moreover, we conduct controlled experiments, which shows partially-typed datasets leads to similar performance with the model trained with the same amount of fully-typed annotations

View on arXiv PDF

Similar