CLSep 16, 2020

Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT

Alexandra Chronopoulou, Dario Stojanovski, Alexander Fraser

arXiv:2009.07610v331.21005 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of building translation systems for low-resource languages, though it is incremental as it builds on existing pretraining and fine-tuning methods.

The paper tackled the problem of poor translation quality in unsupervised neural machine translation when one language has limited monolingual data, by reusing a pretrained language model from a high-resource language and extending its vocabulary, resulting in over +8.3 BLEU point improvements in English-Macedonian and English-Albanian translations.

Using a language model (LM) pretrained on two languages with large monolingual data in order to initialize an unsupervised neural machine translation (UNMT) system yields state-of-the-art results. When limited data is available for one language, however, this method leads to poor translations. We present an effective approach that reuses an LM that is pretrained only on the high-resource language. The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model. To reuse the pretrained LM, we have to modify its predefined vocabulary, to account for the new language. We therefore propose a novel vocabulary extension method. Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq), yielding more than +8.3 BLEU points for all four translation directions.

View on arXiv PDF Code

Similar