CLJun 10, 2019

Generalized Data Augmentation for Low-Resource Translation

arXiv:1906.03785v11137 citations
Originality Incremental advance
AI Analysis

This addresses translation challenges for low-resource languages, offering an incremental improvement over existing methods like back-translation.

The paper tackles the problem of low-resource machine translation by proposing a general data augmentation framework that uses target-side monolingual data and pivots through a high-resource language, improving translation quality by up to 8 BLEU points in extreme low-resource settings.

Translation to or from low-resource languages LRLs poses challenges for machine translation in terms of both adequacy and fluency. Data augmentation utilizing large amounts of monolingual data is regarded as an effective way to alleviate these problems. In this paper, we propose a general framework for data augmentation in low-resource machine translation that not only uses target-side monolingual data, but also pivots through a related high-resource language HRL. Specifically, we experiment with a two-step pivoting method to convert high-resource data to the LRL, making use of available resources to better approximate the true data distribution of the LRL. First, we inject LRL words into HRL sentences through an induced bilingual dictionary. Second, we further edit these modified sentences using a modified unsupervised machine translation framework. Extensive experiments on four low-resource datasets show that under extreme low-resource settings, our data augmentation techniques improve translation quality by up to~1.5 to~8 BLEU points compared to supervised back-translation baselines

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes