CLMar 27, 2019

Using Monolingual Data in Neural Machine Translation: a Systematic Study

arXiv:1903.11437v11127 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently using monolingual data in neural machine translation, offering incremental improvements for developers.

The paper systematically studies back-translation in neural machine translation, confirming its effectiveness and introducing cheaper data simulation techniques that are nearly as effective.

Neural Machine Translation (MT) has radically changed the way systems are developed. A major difference with the previous generation (Phrase-Based MT) is the way monolingual target data, which often abounds, is used in these two paradigms. While Phrase-Based MT can seamlessly integrate very large language models trained on billions of sentences, the best option for Neural MT developers seems to be the generation of artificial parallel data through \textsl{back-translation} - a technique that fails to fully take advantage of existing datasets. In this paper, we conduct a systematic study of back-translation, comparing alternative uses of monolingual data, as well as multiple data generation procedures. Our findings confirm that back-translation is very effective and give new explanations as to why this is the case. We also introduce new data simulation techniques that are almost as effective, yet much cheaper to implement.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes