CL AIAug 22, 2024

High-Quality Data Augmentation for Low-Resource NMT: Combining a Translation Memory, a GAN Generator, and Filtering

arXiv:2408.12079v11.02 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses data scarcity in machine translation for low-resource languages, offering an incremental improvement over existing augmentation techniques.

The paper tackles low-resource neural machine translation by proposing a method that combines a translation memory, a GAN generator, and filtering to augment training data, resulting in improved translation quality as evidenced by BLEU score gains of up to 2.5 points on benchmark datasets.

Back translation, as a technique for extending a dataset, is widely used by researchers in low-resource language translation tasks. It typically translates from the target to the source language to ensure high-quality translation results. This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings. We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator. Additionally, this paper integrates Translation Memory (TM) with NMT, increasing the amount of data available to the generator. Moreover, we propose a novel procedure to filter the synthetic sentence pairs during the augmentation process, ensuring the high quality of the data.

View on arXiv PDF

Similar