CLSep 19, 2018

NICT's Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task

Rui Wang, Benjamin Marie, Masao Utiyama, Eiichiro Sumita

arXiv:1809.07043v232.01091 citations

Originality Synthesis-oriented

AI Analysis

This work addresses data quality issues for machine translation researchers and practitioners, but it is incremental as it applies existing classification methods to a new noisy dataset.

The paper tackled the problem of filtering noisy web-crawled German-English parallel data to improve neural machine translation (NMT) performance, resulting in NMT systems trained on sampled data (100 million and 10 million words) that achieved promising results, though specific numbers are not provided.

This paper presents the NICT's participation in the WMT18 shared parallel corpus filtering task. The organizers provided 1 billion words German-English corpus crawled from the web as part of the Paracrawl project. This corpus is too noisy to build an acceptable neural machine translation (NMT) system. Using the clean data of the WMT18 shared news translation task, we designed several features and trained a classifier to score each sentence pairs in the noisy data. Finally, we sampled 100 million and 10 million words and built corresponding NMT systems. Empirical results show that our NMT systems trained on sampled data achieve promising performance.

View on arXiv PDF

Similar