CL AI LGDec 11, 2021

Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Javad Pourmostafa Roshan Sharami, Dimitar Shterionov, Pieter Spronck

arXiv:2112.06096v31.411 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the issue of domain adaptation for machine translation practitioners, offering an incremental improvement by efficiently selecting relevant training data.

The paper tackles the problem of generic neural machine translation models performing poorly on domain-specific tasks by proposing a method to select in-domain sentences from parallel general-domain corpora using cosine similarity with monolingual domain-specific data. The result shows that models trained on this selected data outperform those trained on generic or mixed data, achieving high-quality domain-specific performance with low computational cost and data size.

Continuously-growing data volumes lead to larger generic models. Specific use-cases are usually left out, since generic models tend to perform poorly in domain-specific cases. Our work addresses this gap with a method for selecting in-domain data from generic-domain (parallel text) corpora, for the task of machine translation. The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set. We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data. Our experimental results show that models trained on this in-domain data outperform models trained on generic or a mixture of generic and domain data. That is, our method selects high-quality domain-specific training instances at low computational cost and data size.

View on arXiv PDF Code

Similar