CLAIDec 15, 2021

Learning Cross-Lingual IR from an English Retriever

arXiv:2112.08185v3632 citations
Originality Incremental advance
AI Analysis

This addresses the problem of efficient and accurate cross-lingual retrieval for multilingual applications, representing an incremental improvement over existing methods.

The paper tackles cross-lingual information retrieval by introducing DR.DECR, a system trained via multi-stage knowledge distillation from a teacher model, achieving superior accuracy over direct fine-tuning and becoming the best single-model retriever on the XOR-TyDi benchmark.

We present DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual Representation), a new cross-lingual information retrieval (CLIR) system trained using multi-stage knowledge distillation (KD). The teacher of DR.DECR relies on a highly effective but computationally expensive two-stage inference process consisting of query translation and monolingual IR, while the student, DR.DECR, executes a single CLIR step. We teach DR.DECR powerful multilingual representations as well as CLIR by optimizing two corresponding KD objectives. Learning useful representations of non-English text from an English-only retriever is accomplished through a cross-lingual token alignment algorithm that relies on the representation capabilities of the underlying multilingual encoders. In both in-domain and zero-shot out-of-domain evaluation, DR.DECR demonstrates far superior accuracy over direct fine-tuning with labeled CLIR data. It is also the best single-model retriever on the XOR-TyDi benchmark at the time of this writing.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes