CLAug 31, 2019

Explicit Cross-lingual Pre-training for Unsupervised Machine Translation

arXiv:1909.00180v11031 citations
Originality Highly original
AI Analysis

This work addresses the challenge of improving translation quality in unsupervised machine translation, which is incremental as it builds on existing pre-training approaches.

The paper tackles the problem of limited cross-lingual information in unsupervised machine translation by proposing a novel pre-training method that incorporates explicit cross-lingual signals, resulting in significant performance improvements.

Pre-training has proven to be effective in unsupervised machine translation due to its ability to model deep context information in cross-lingual scenarios. However, the cross-lingual information obtained from shared BPE spaces is inexplicit and limited. In this paper, we propose a novel cross-lingual pre-training method for unsupervised machine translation by incorporating explicit cross-lingual training signals. Specifically, we first calculate cross-lingual n-gram embeddings and infer an n-gram translation table from them. With those n-gram translation pairs, we propose a new pre-training model called Cross-lingual Masked Language Model (CMLM), which randomly chooses source n-grams in the input text stream and predicts their translation candidates at each time step. Experiments show that our method can incorporate beneficial cross-lingual information into pre-trained models. Taking pre-trained CMLM models as the encoder and decoder, we significantly improve the performance of unsupervised machine translation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes