CLIRMLSep 29, 2015

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

arXiv:1509.08881v153 citations
Originality Incremental advance
AI Analysis

This work addresses the practical problem of limited parallel data for machine translation and cross-lingual retrieval, offering an incremental improvement by mining more useful resources from abundant non-parallel multilingual sources.

The researchers tackled the scarcity of parallel sentences for cross-lingual applications by developing a method to build subject-aligned comparable corpora from Wikipedia and extract truly parallel sentences from noisy data, resulting in a specialized tool and improved machine translation system.

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity of comparable sentence pairs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes