CLAIApr 8, 2021

Extended Parallel Corpus for Amharic-English Machine Translation

arXiv:2104.03543v3585 citations
AI Analysis

This work addresses the problem of limited resources for Amharic-English machine translation, providing a new dataset and baseline models for researchers, but it is incremental as it applies existing methods to new data.

The paper tackles machine translation for the low-resource language Amharic by creating and releasing an extended parallel corpus, and it shows that neural models outperform statistical models by 6-7 BLEU points, with subword models further improving by 3-4 BLEU points.

This paper describes the acquisition, preprocessing, segmentation, and alignment of an Amharic-English parallel corpus. It will be helpful for machine translation of a low-resource language, Amharic. We freely released the corpus for research purposes. Furthermore, we developed baseline statistical and neural machine translation systems; we trained statistical and neural machine translation models using the corpus. In the experiments, we also used a large monolingual corpus for the language model of statistical machine translation and back-translation of neural machine translation. In the automatic evaluation, neural machine translation models outperform statistical machine translation models by approximately six to seven Bilingual Evaluation Understudy (BLEU) points. Besides, among the neural machine translation models, the subword models outperform the word-based models by three to four BLEU points. Moreover, two other relevant automatic evaluation metrics, Translation Edit Rate on Character Level and Better Evaluation as Ranking, reflect corresponding differences among the trained models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes