CLOct 27, 2022

The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation

Tadesse Destaw Belay, Atnafu Lambebo Tonja, Olga Kolesnikova, Seid Muhie Yimam, Abinew Ali Ayele, Silesh Bogale Haile, Grigori Sidorov, Alexander Gelbukh

arXiv:2210.15224v11.68 citationsh-index: 47Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses machine translation for Amharic, a low-resource language, by providing a new dataset and showing improvements with text normalization, though it is incremental as it builds on existing models.

The authors tackled the lack of large-scale parallel data for Amharic-English translation by compiling a dataset and fine-tuning a pre-trained model, achieving BLEU scores of 37.79 and 32.74 for Amharic-English and English-Amharic directions, respectively, and found that normalizing Amharic homophones improved performance in both directions.

Machine translation (MT) is one of the main tasks in natural language processing whose objective is to translate texts automatically from one natural language to another. Nowadays, using deep neural networks for MT tasks has received great attention. These networks require lots of data to learn abstract representations of the input and store it in continuous vectors. This paper presents the first relatively large-scale Amharic-English parallel sentence dataset. Using these compiled data, we build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model achieving a BLEU score of 37.79 in Amharic-English 32.74 in English-Amharic translation. Additionally, we explore the effects of Amharic homophone normalization on the machine translation task. The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.

View on arXiv PDF Code

Similar