CLSep 18, 2020

Unsupervised Parallel Corpus Mining on Web Data

arXiv:2009.08595v10.52 citations

Originality Incremental advance

AI Analysis

This addresses the high cost of human-labeled parallel data for machine translation by enabling unsupervised extraction from web sources, though it is incremental as it builds on existing mining methods.

The paper tackles the problem of mining parallel corpora from the Internet without labeled data, achieving performance close to supervised methods on English-French and English-German benchmarks and setting new state-of-the-art results on English-Romanian benchmarks with BLEU scores of 39.81 and 38.95.

With a large amount of parallel data, neural machine translation systems are able to deliver human-level performance for sentence-level translation. However, it is costly to label a large amount of parallel data by humans. In contrast, there is a large-scale of parallel corpus created by humans on the Internet. The major difficulty to utilize them is how to filter them out from the noise website environments. Current parallel data mining methods all require labeled parallel data as the training source. In this paper, we present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner. On the widely used WMT'14 English-French and WMT'16 English-German benchmarks, the machine translator trained with the data extracted by our pipeline achieves very close performance to the supervised results. On the WMT'16 English-Romanian and Romanian-English benchmarks, our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.

View on arXiv PDF

Similar