Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora
This work addresses the challenge of enhancing named entity recognition in low-resource language settings, though it appears incremental as it focuses on data optimization rather than novel architectural changes.
The study tackled the problem of improving neural named entity recognition for low-resource languages like German by optimizing and processing large corpora before training, resulting in up to an 11% F-score gain and establishing new state-of-the-art performance on open-source datasets.
This study improves the performance of neural named entity recognition by a margin of up to 11% in F-score on the example of a low-resource language like German, thereby outperforming existing baselines and establishing a new state-of-the-art on each single open-source dataset. Rather than designing deeper and wider hybrid neural architectures, we gather all available resources and perform a detailed optimization and grammar-dependent morphological processing consisting of lemmatization and part-of-speech tagging prior to exposing the raw data to any training process. We test our approach in a threefold monolingual experimental setup of a) single, b) joint, and c) optimized training and shed light on the dependency of downstream-tasks on the size of corpora used to compute word embeddings.