CLApr 30, 2020

Simulated Multiple Reference Training Improves Low-Resource Machine Translation

Huda Khayrallah, Brian Thompson, Matt Post, Philipp Koehn

arXiv:2004.14524v21004 citations

AI Analysis

This addresses data scarcity for low-resource machine translation, though it is incremental as it builds on existing methods like back-translation.

The paper tackled the problem of data sparsity in low-resource machine translation by introducing Simulated Multiple Reference Training (SMRT), which approximates multiple translations using a paraphraser, resulting in BLEU score improvements of 1.2 to 7.0.

Many valid translations exist for a given sentence, yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce Simulated Multiple Reference Training (SMRT), a novel MT training method that approximates the full space of possible translations by sampling a paraphrase of the reference sentence from a paraphraser and training the MT model to predict the paraphraser's distribution over possible tokens. We demonstrate the effectiveness of SMRT in low-resource settings when translating to English, with improvements of 1.2 to 7.0 BLEU. We also find SMRT is complementary to back-translation.

View on arXiv PDF

Similar