CLJan 29, 2021

Synthesizing Monolingual Data for Neural Machine Translation

arXiv:2101.12462v10.53 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of data scarcity for low-resource language pairs and domains in machine translation, though it is incremental as it builds on existing back-translation techniques.

The paper tackles the problem of limited monolingual data for neural machine translation by proposing a method to generate large synthetic parallel data from very small in-domain monolingual data, showing effectiveness in improving NMT across three language pairs and five domains.

In neural machine translation (NMT), monolingual data in the target language are usually exploited through a method so-called "back-translation" to synthesize additional training parallel data. The synthetic data have been shown helpful to train better NMT, especially for low-resource language pairs and domains. Nonetheless, large monolingual data in the target domains or languages are not always available to generate large synthetic parallel data. In this work, we propose a new method to generate large synthetic parallel data leveraging very small monolingual data in a specific domain. We fine-tune a pre-trained GPT-2 model on such small in-domain monolingual data and use the resulting model to generate a large amount of synthetic in-domain monolingual data. Then, we perform back-translation, or forward translation, to generate synthetic in-domain parallel data. Our preliminary experiments on three language pairs and five domains show the effectiveness of our method to generate fully synthetic but useful in-domain parallel data for improving NMT in all configurations. We also show promising results in extreme adaptation for personalized NMT.

View on arXiv PDF

Similar