CLJul 29, 2017

Bilingual Document Alignment with Latent Semantic Indexing

arXiv:1707.09443v139.21091 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of aligning bilingual documents for machine translation or information retrieval, but it is incremental as it applies an existing method to a specific benchmark.

The authors tackled the bilingual document alignment task by applying cross-lingual Latent Semantic Indexing to map English and French web pages into a joint semantic space, achieving recall rates of 88% without in-domain data and 93% with it.

We apply cross-lingual Latent Semantic Indexing to the Bilingual Document Alignment Task at WMT16. Reduced-rank singular value decomposition of a bilingual term-document matrix derived from known English/French page pairs in the training data allows us to map monolingual documents into a joint semantic space. Two variants of cosine similarity between the vectors that place each document into the joint semantic space are combined with a measure of string similarity between corresponding URLs to produce 1:1 alignments of English/French web pages in a variety of domains. The system achieves a recall of ca. 88% if no in-domain data is used for building the latent semantic model, and 93% if such data is included. Analysing the system's errors on the training data, we argue that evaluating aligner performance based on exact URL matches under-estimates their true performance and propose an alternative that is able to account for duplicates and near-duplicates in the underlying data.

View on arXiv PDF

Similar