CLAug 28, 2018

Understanding Back-Translation at Scale

arXiv:1808.09381v21584 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of enhancing translation quality for machine translation systems by effectively utilizing monolingual data, though it is incremental as it builds on existing back-translation techniques.

The paper investigated methods for generating synthetic source sentences via back-translation to improve neural machine translation, finding that sampling or noised beam outputs are most effective in most settings, and achieved a new state-of-the-art of 35 BLEU on WMT'14 English-German.

An effective method to improve neural machine translation with monolingual data is to augment the parallel training corpus with back-translations of target language sentences. This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences. We find that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective. Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search. We also compare how synthetic data compares to genuine bitext and study various domain effects. Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU on the WMT'14 English-German test set.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes