CLMay 27, 2021

Extremely low-resource machine translation for closely related languages

arXiv:2105.13065v131.7726 citations

Originality Synthesis-oriented

AI Analysis

This work addresses translation challenges for low-resource languages like Uralic family members, though it appears incremental as it builds on established techniques like back-translation and multilingual training.

The researchers tackled extremely low-resource machine translation for closely related Uralic languages by using multilingual training and synthetic corpora from back-translation, achieving improved translation quality across all tested language pairs and presenting first neural translation results for Võro, North and South Saami.

An effective method to improve extremely low-resource neural machine translation is multilingual training, which can be improved by leveraging monolingual data to create synthetic bilingual corpora using the back-translation method. This work focuses on closely related languages from the Uralic language family: from Estonian and Finnish geographical regions. We find that multilingual learning and synthetic corpora increase the translation quality in every language pair for which we have data. We show that transfer learning and fine-tuning are very effective for doing low-resource machine translation and achieve the best results. We collected new parallel data for Võro, North and South Saami and present first results of neural machine translation for these languages.

View on arXiv PDF

Similar