CLAILGMar 27, 2023

Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

arXiv:2303.15265v112 citationsh-index: 46Has Code
Originality Incremental advance
AI Analysis

It addresses translation quality issues in low-resource languages for NLP applications, but is incremental as it builds on existing unsupervised methods with lexical augmentation.

This paper tackles the problem of improving unsupervised machine translation by using bilingual lexica to enhance translation of common nouns, demonstrating sizable performance gains on 200-language models trained on web-crawled text.

Neural machine translation (NMT) has progressed rapidly over the past several years, and modern models are able to achieve relatively high quality using only monolingual text data, an approach dubbed Unsupervised Machine Translation (UNMT). However, these models still struggle in a variety of ways, including aspects of translation that for a human are the easiest - for instance, correctly translating common nouns. This work explores a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Finally, we open-source GATITOS (available at https://github.com/google-research/url-nlp/tree/main/gatitos), a new multilingual lexicon for 26 low-resource languages, which had the highest performance among lexica in our experiments.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes