CLLGOct 15, 2020

Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

arXiv:2010.07761v1854 citations
Originality Incremental advance
AI Analysis

This addresses the problem of data scarcity for machine translation researchers, offering an incremental enhancement to existing unsupervised methods.

The paper tackles unsupervised machine translation by creating pseudo-parallel corpora from unaligned text using multilingual BERT and self-training, resulting in a 24.5-point F1 increase in bitext mining and up to 3.5 BLEU improvement in translation tasks.

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT'14 French-English and WMT'16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT'15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes