CL AIDec 15, 2021

Learning Cross-Lingual IR from an English Retriever

Yulong Li, Martin Franz, Md Arafat Sultan, Bhavani Iyer, Young-Suk Lee, Avirup Sil

arXiv:2112.08185v330.3632 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the problem of efficient and accurate cross-lingual retrieval for multilingual applications, representing an incremental improvement over existing methods.

The paper tackles cross-lingual information retrieval by introducing DR.DECR, a system trained via multi-stage knowledge distillation from a teacher model, achieving superior accuracy over direct fine-tuning and becoming the best single-model retriever on the XOR-TyDi benchmark.

We present DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual Representation), a new cross-lingual information retrieval (CLIR) system trained using multi-stage knowledge distillation (KD). The teacher of DR.DECR relies on a highly effective but computationally expensive two-stage inference process consisting of query translation and monolingual IR, while the student, DR.DECR, executes a single CLIR step. We teach DR.DECR powerful multilingual representations as well as CLIR by optimizing two corresponding KD objectives. Learning useful representations of non-English text from an English-only retriever is accomplished through a cross-lingual token alignment algorithm that relies on the representation capabilities of the underlying multilingual encoders. In both in-domain and zero-shot out-of-domain evaluation, DR.DECR demonstrates far superior accuracy over direct fine-tuning with labeled CLIR data. It is also the best single-model retriever on the XOR-TyDi benchmark at the time of this writing.

View on arXiv PDF Code

Similar