CL AI IRFeb 20, 2021

CDA: a Cost Efficient Content-based Multilingual Web Document Aligner

arXiv:2102.10246v132.7801 citations

Originality Incremental advance

AI Analysis

This addresses the need for efficient parallel data creation for industrial machine translation systems, particularly for large, noisy web data and low-resourced languages, though it appears incremental as it builds on existing lexical translation models.

The paper tackles the problem of aligning multilingual web documents to create parallel training data for machine translation, introducing CDA, which achieves performance comparable to state-of-the-art systems on benchmarks and demonstrates robustness and cost-effectiveness in web-scale experiments with up to 28 languages and millions of documents.

We introduce a Content-based Document Alignment approach (CDA), an efficient method to align multilingual web documents based on content in creating parallel training data for machine translation (MT) systems operating at the industrial level. CDA works in two steps: (i) projecting documents of a web domain to a shared multilingual space; then (ii) aligning them based on the similarity of their representations in such space. We leverage lexical translation models to build vector representations using TF-IDF. CDA achieves performance comparable with state-of-the-art systems in the WMT-16 Bilingual Document Alignment Shared Task benchmark while operating in multilingual space. Besides, we created two web-scale datasets to examine the robustness of CDA in an industrial setting involving up to 28 languages and millions of documents. The experiments show that CDA is robust, cost-effective, and is significantly superior in (i) processing large and noisy web data and (ii) scaling to new and low-resourced languages.

View on arXiv PDF

Similar