CLJan 29, 2021

Synthesizing Monolingual Data for Neural Machine Translation

arXiv:2101.12462v13 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of data scarcity for low-resource language pairs and domains in machine translation, though it is incremental as it builds on existing back-translation techniques.

The paper tackles the problem of limited monolingual data for neural machine translation by proposing a method to generate large synthetic parallel data from very small in-domain monolingual data, showing effectiveness in improving NMT across three language pairs and five domains.

In neural machine translation (NMT), monolingual data in the target language are usually exploited through a method so-called "back-translation" to synthesize additional training parallel data. The synthetic data have been shown helpful to train better NMT, especially for low-resource language pairs and domains. Nonetheless, large monolingual data in the target domains or languages are not always available to generate large synthetic parallel data. In this work, we propose a new method to generate large synthetic parallel data leveraging very small monolingual data in a specific domain. We fine-tune a pre-trained GPT-2 model on such small in-domain monolingual data and use the resulting model to generate a large amount of synthetic in-domain monolingual data. Then, we perform back-translation, or forward translation, to generate synthetic in-domain parallel data. Our preliminary experiments on three language pairs and five domains show the effectiveness of our method to generate fully synthetic but useful in-domain parallel data for improving NMT in all configurations. We also show promising results in extreme adaptation for personalized NMT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes