CLMay 1, 2017

Data Augmentation for Low-Resource Neural Machine Translation

Marzieh Fadaee, Arianna Bisazza, Christof Monz

arXiv:1705.00440v1508 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of limited parallel data for low-resource language pairs, though it is incremental as it builds on existing data augmentation ideas.

The paper tackles the problem of poor translation quality in low-resource neural machine translation by proposing a data augmentation approach that generates new sentence pairs for rare words, resulting in improvements of up to 2.9 BLEU points over the baseline and 3.2 BLEU over back-translation.

The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, synthetically created contexts. Experimental results on simulated low-resource settings show that our method improves translation quality by up to 2.9 BLEU points over the baseline and up to 3.2 BLEU over back-translation.

View on arXiv PDF

Similar