CLAIOct 11, 2022

MTet: Multi-domain Translation for English and Vietnamese

arXiv:2210.05610v211 citationsh-index: 27
Originality Incremental advance
AI Analysis

This work addresses the problem of limited translation resources for English-Vietnamese, benefiting NLP researchers and practitioners, though it is incremental as it builds on existing datasets.

The authors tackled the lack of large parallel corpora for English-Vietnamese translation by introducing MTet, a 4.2M sentence pair dataset, and EnViT5, a pretrained model, which together improved translation BLEU scores by up to 2 points while reducing model size by 1.6 times.

We introduce MTet, the largest publicly available parallel corpus for English-Vietnamese translation. MTet consists of 4.2M high-quality training sentence pairs and a multi-domain test set refined by the Vietnamese research community. Combining with previous works on English-Vietnamese translation, we grow the existing parallel dataset to 6.2M sentence pairs. We also release the first pretrained model EnViT5 for English and Vietnamese languages. Combining both resources, our model significantly outperforms previous state-of-the-art results by up to 2 points in translation BLEU score, while being 1.6 times smaller.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes