CLMar 28, 2024

EthioMT: Parallel Corpus for Low-resource Ethiopian Languages

Atnafu Lambebo Tonja, Olga Kolesnikova, Alexander Gelbukh, Jugal Kalita

arXiv:2403.19365v123.679 citationsh-index: 21RAIL

Originality Synthesis-oriented

AI Analysis

This addresses the lack of publicly accessible datasets for NLP tasks in Ethiopian languages, fostering research in this low-resource domain, though it is incremental as it builds on existing methods.

The authors tackled the problem of low-resource machine translation for Ethiopian languages by introducing EthioMT, a parallel corpus for 15 languages, and creating a benchmark dataset for 23 languages, evaluating it with transformer and fine-tuning approaches.

Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available at all. NLP in Ethiopian languages suffers from the same issues due to the unavailability of publicly accessible datasets for NLP tasks, including MT. To help the research community and foster research for Ethiopian languages, we introduce EthioMT -- a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.

View on arXiv PDF

Similar