CLOct 21, 2020

Improving Simultaneous Translation by Incorporating Pseudo-References with Fewer Reorderings

arXiv:2010.11247v2666 citations
Originality Incremental advance
AI Analysis

This addresses the lack of high-quality datasets for simultaneous translation, which is crucial for real-time applications like live captioning or interpretation.

The paper tackles the problem of training simultaneous translation systems without large-scale datasets by rewriting existing full-sentence corpora into simultaneous-style translations, resulting in improvements of up to +2.7 BLEU on Zh->En and Ja->En tasks.

Simultaneous translation is vastly different from full-sentence translation, in the sense that it starts translation before the source sentence ends, with only a few words delay. However, due to the lack of large-scale, high-quality simultaneous translation datasets, most such systems are still trained on conventional full-sentence bitexts. This is far from ideal for the simultaneous scenario due to the abundance of unnecessary long-distance reorderings in those bitexts. We propose a novel method that rewrites the target side of existing full-sentence corpora into simultaneous-style translation. Experiments on Zh->En and Ja->En simultaneous translation show substantial improvements (up to +2.7 BLEU) with the addition of these generated pseudo-references.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes