A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala
This addresses the problem of limited annotated data for low-resource languages like Sinhala and Tamil, enabling improved NER and NMT applications, though it is incremental as it applies existing methods to new data.
The paper tackles the lack of named entity annotated resources for low-resource languages by creating a multi-way parallel English-Tamil-Sinhala corpus, establishing new benchmark NER results for Sinhala and Tamil using pre-trained multilingual language models, and demonstrating utility in a low-resource neural machine translation task.
This paper presents a multi-way parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs), where Sinhala and Tamil are low-resource languages. Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil. We also carry out a detailed investigation on the NER capabilities of different types of mLMs. Finally, we demonstrate the utility of our NER system on a low-resource Neural Machine Translation (NMT) task. Our dataset is publicly released: https://github.com/suralk/multiNER.