CLApr 30, 2025

Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data

arXiv:2504.21747v2h-index: 22
Originality Incremental advance
AI Analysis

This work addresses the challenge of limited bilingual data for machine translation practitioners, though it is incremental as it builds on existing retrieval-augmented methods.

The paper tackled the problem of retrieval-augmented neural machine translation by leveraging monolingual target language data instead of relying solely on bilingual corpora, achieving performance that matches standard translation memory-based models and showing strong improvements in real-world settings.

Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, monolingual corpora in the target language are often available. This work explores ways to take advantage of such resources by directly retrieving relevant target language segments, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with three RANMT architectures, we assess such cross-lingual objectives in a controlled setting, reaching performances that match those of standard TM-based models. We also showcase our method on a real-world settings, using much larger monolingual and observe strong improvements over both the baseline setting and general-purpose cross-lingual retrievers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes