Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data
This work addresses the challenge of limited bilingual data for machine translation practitioners, though it is incremental as it builds on existing retrieval-augmented methods.
The paper tackled the problem of retrieval-augmented neural machine translation by leveraging monolingual target language data instead of relying solely on bilingual corpora, achieving performance that matches standard translation memory-based models and showing strong improvements in real-world settings.
Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, monolingual corpora in the target language are often available. This work explores ways to take advantage of such resources by directly retrieving relevant target language segments, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with three RANMT architectures, we assess such cross-lingual objectives in a controlled setting, reaching performances that match those of standard TM-based models. We also showcase our method on a real-world settings, using much larger monolingual and observe strong improvements over both the baseline setting and general-purpose cross-lingual retrievers.