CLApr 6, 2020

Dictionary-based Data Augmentation for Cross-Domain Neural Machine Translation

arXiv:2004.02577v127 citations
AI Analysis

This addresses the issue of translation errors for low-frequency and out-of-vocabulary terminology in cross-domain NMT, offering an incremental improvement over existing methods like back-translation.

The paper tackles the problem of domain information gaps in neural machine translation by proposing a dictionary-based data augmentation method that synthesizes domain-specific dictionaries with general corpora to generate pseudo-in-domain data, resulting in improvements of 3.75-11.53 BLEU over baseline models.

Existing data augmentation approaches for neural machine translation (NMT) have predominantly relied on back-translating in-domain (IND) monolingual corpora. These methods suffer from issues associated with a domain information gap, which leads to translation errors for low frequency and out-of-vocabulary terminology. This paper proposes a dictionary-based data augmentation (DDA) method for cross-domain NMT. DDA synthesizes a domain-specific dictionary with general domain corpora to automatically generate a large-scale pseudo-IND parallel corpus. The generated pseudo-IND data can be used to enhance a general domain trained baseline. The experiments show that the DDA-enhanced NMT models demonstrate consistent significant improvements, outperforming the baseline models by 3.75-11.53 BLEU. The proposed method is also able to further improve the performance of the back-translation based and IND-finetuned NMT models. The improvement is associated with the enhanced domain coverage produced by DDA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes