Building a Neural Machine Translation System Using Only Synthetic Parallel Data
This work addresses the problem of data scarcity for neural machine translation by proposing a novel synthetic data approach, which is incremental as it builds on prior methods for generating synthetic data.
The study tackled building neural machine translation systems using only synthetic parallel data, introducing a new pseudo parallel corpus that mixes ground truth and synthetic examples on both sides of sentence pairs, and experiments on Czech-German and French-German translations showed enhanced results for bidirectional tasks and substantial improvement when combined with real parallel data.
Recent works have shown that synthetic parallel data automatically generated by translation models can be effective for various neural machine translation (NMT) issues. In this study, we build NMT systems using only synthetic parallel data. As an efficient alternative to real parallel data, we also present a new type of synthetic parallel corpus. The proposed pseudo parallel data are distinct from previous works in that ground truth and synthetic examples are mixed on both sides of sentence pairs. Experiments on Czech-German and French-German translations demonstrate the efficacy of the proposed pseudo parallel corpus, which shows not only enhanced results for bidirectional translation tasks but also substantial improvement with the aid of a ground truth real parallel corpus.