scb-mt-en-th-2020: A Large English-Thai Parallel Corpus
This provides a valuable dataset for machine translation research and applications involving English and Thai, addressing a resource gap in this language pair.
The authors constructed a large-scale English-Thai parallel corpus with over 1 million segment pairs from diverse sources, and trained machine translation models that achieved performance comparable to or better than Google Translation API when including additional data.
The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models' performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use.