CLAug 5, 2016

Resolving Out-of-Vocabulary Words with Bilingual Embeddings in Machine Translation

Pranava Swaroop Madhyastha, Cristina España-Bonet

arXiv:1608.01910v12.31 citations

Originality Incremental advance

AI Analysis

This addresses translation errors for users of phrase-based statistical machine translation systems, particularly in out-of-domain scenarios, but is incremental as it builds on existing embedding and translation methods.

The paper tackles the problem of out-of-vocabulary words causing errors in machine translation by proposing a log-bilinear softmax-based model for vocabulary expansion, which generates probabilistic translation lists and improves translation quality by 3.9 BLEU points on an out-of-domain English-Spanish test set.

Out-of-vocabulary words account for a large proportion of errors in machine translation systems, especially when the system is used on a different domain than the one where it was trained. In order to alleviate the problem, we propose to use a log-bilinear softmax-based model for vocabulary expansion, such that given an out-of-vocabulary source word, the model generates a probabilistic list of possible translations in the target language. Our model uses only word embeddings trained on significantly large unlabelled monolingual corpora and trains over a fairly small, word-to-word bilingual dictionary. We input this probabilistic list into a standard phrase-based statistical machine translation system and obtain consistent improvements in translation quality on the English-Spanish language pair. Especially, we get an improvement of 3.9 BLEU points when tested over an out-of-domain test set.

View on arXiv PDF

Similar