IR CLApr 12, 2018

Learning Multilingual Embeddings for Cross-Lingual Information Retrieval in the Presence of Topically Aligned Corpora

Mitodru Niyogi, Kripabandhu Ghosh, Arnab Bhattacharya

arXiv:1804.04475v11.72 citationsh-index: 27

Originality Incremental advance

AI Analysis

This addresses the problem of cross-lingual information retrieval for researchers and practitioners in multilingual contexts, offering an incremental improvement by using topically aligned instead of parallel corpora.

The paper tackles cross-lingual information retrieval without aligned parallel corpora by learning multilingual embeddings from topically aligned corpora, achieving superior performance and faster time requirements compared to state-of-the-art methods on FIRE datasets for Bangla, Hindi, and English.

Cross-lingual information retrieval is a challenging task in the absence of aligned parallel corpora. In this paper, we address this problem by considering topically aligned corpora designed for evaluating an IR setup. To emphasize, we neither use any sentence-aligned corpora or document-aligned corpora, nor do we use any language specific resources such as dictionary, thesaurus, or grammar rules. Instead, we use an embedding into a common space and learn word correspondences directly from there. We test our proposed approach for bilingual IR on standard FIRE datasets for Bangla, Hindi and English. The proposed method is superior to the state-of-the-art method not only for IR evaluation measures but also in terms of time requirements. We extend our method successfully to the trilingual setting.

View on arXiv PDF

Similar