Low-Resource Neural Machine Translation for Southern African Languages
This work addresses the problem of limited translation data for low-resource African languages, which is incremental as it applies existing methods to new language pairs.
The paper tackled low-resource neural machine translation for Southern African Bantu languages by comparing zero-shot, transfer, and multilingual learning, showing that multilingual learning achieved the best results with BLEU score improvements of up to 9.9 over the baseline and over 10 over previous SOTA.
Low-resource African languages have not fully benefited from the progress in neural machine translation because of a lack of data. Motivated by this challenge we compare zero-shot learning, transfer learning and multilingual learning on three Bantu languages (Shona, isiXhosa and isiZulu) and English. Our main target is English-to-isiZulu translation for which we have just 30,000 sentence pairs, 28% of the average size of our other corpora. We show the importance of language similarity on the performance of English-to-isiZulu transfer learning based on English-to-isiXhosa and English-to-Shona parent models whose BLEU scores differ by 5.2. We then demonstrate that multilingual learning surpasses both transfer learning and zero-shot learning on our dataset, with BLEU score improvements relative to the baseline English-to-isiZulu model of 9.9, 6.1 and 2.0 respectively. Our best model also improves the previous SOTA BLEU score by more than 10.