CL AI LGMar 12, 2021

Bilingual Dictionary-based Language Model Pretraining for Neural Machine Translation

Yusen Lin, Jiayong Lin, Shuaicheng Zhang, Haoying Dai

arXiv:2103.07040v10.51 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of high data costs in machine translation, particularly for low-resource languages, by offering a more efficient pretraining method.

The paper tackles the problem of reducing reliance on expensive parallel corpora for neural machine translation by proposing a Bilingual Dictionary-based Language Model (BDLM) that incorporates translation information from dictionaries during pretraining. It achieved a 55.0 BLEU on WMT-News19 and a 24.3 BLEU on WMT20 news-commentary for Chinese-English, outperforming the Vanilla Transformer by more than 8.4 and 2.3 BLEU, respectively.

Recent studies have demonstrated a perceivable improvement on the performance of neural machine translation by applying cross-lingual language model pretraining (Lample and Conneau, 2019), especially the Translation Language Modeling (TLM). To alleviate the need for expensive parallel corpora by TLM, in this work, we incorporate the translation information from dictionaries into the pretraining process and propose a novel Bilingual Dictionary-based Language Model (BDLM). We evaluate our BDLM in Chinese, English, and Romanian. For Chinese-English, we obtained a 55.0 BLEU on WMT-News19 (Tiedemann, 2012) and a 24.3 BLEU on WMT20 news-commentary, outperforming the Vanilla Transformer (Vaswani et al., 2017) by more than 8.4 BLEU and 2.3 BLEU, respectively. According to our results, the BDLM also has advantages on convergence speed and predicting rare words. The increase in BLEU for WMT16 Romanian-English also shows its effectiveness in low-resources language translation.

View on arXiv PDF

Similar