IRCLMay 2, 2024

Distillation for Multilingual Information Retrieval

arXiv:2405.00977v18 citationsh-index: 32Has CodeSIGIR
Originality Incremental advance
AI Analysis

This work addresses the challenge of training models for multilingual document ranking, which is an incremental improvement over existing cross-language retrieval methods.

The paper tackles the problem of multilingual information retrieval (MLIR) where models must rank documents across multiple languages, extending the Translate-Distill framework to propose Multilingual Translate-Distill (MTD). The result shows that ColBERT-X models trained with MTD outperform previous state-of-the-art methods by 5-25% in nDCG@20 and 15-45% in MAP.

Recent work in cross-language information retrieval (CLIR), where queries and documents are in different languages, has shown the benefit of the Translate-Distill framework that trains a cross-language neural dual-encoder model using translation and distillation. However, Translate-Distill only supports a single document language. Multilingual information retrieval (MLIR), which ranks a multilingual document collection, is harder to train than CLIR because the model must assign comparable relevance scores to documents in different languages. This work extends Translate-Distill and propose Multilingual Translate-Distill (MTD) for MLIR. We show that ColBERT-X models trained with MTD outperform their counterparts trained ith Multilingual Translate-Train, which is the previous state-of-the-art training approach, by 5% to 25% in nDCG@20 and 15% to 45% in MAP. We also show that the model is robust to the way languages are mixed in training batches. Our implementation is available on GitHub.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes