Soft Gazetteers for Low-Resource Named Entity Recognition
This addresses the problem of limited entity lists for low-resource language NER, though it is incremental as it builds on existing cross-lingual methods.
The paper tackles the challenge of named entity recognition in low-resource languages by proposing 'soft gazetteers' that use English knowledge bases via cross-lingual entity linking, resulting in an average F1 score improvement of 4 points across four languages.
Traditional named entity recognition models use gazetteers (lists of entities) as features to improve performance. Although modern neural network models do not require such hand-crafted features for strong performance, recent work has demonstrated their utility for named entity recognition on English data. However, designing such features for low-resource languages is challenging, because exhaustive entity gazetteers do not exist in these languages. To address this problem, we propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases, such as Wikipedia, into neural named entity recognition models through cross-lingual entity linking. Our experiments on four low-resource languages show an average improvement of 4 points in F1 score. Code and data are available at https://github.com/neulab/soft-gazetteers.