IRCLJun 21, 2022

Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation

arXiv:2206.10128v388 citationsh-index: 47
Originality Incremental advance
AI Analysis

This addresses a fundamental issue in DSI models for information retrieval, improving performance in both mono-lingual and cross-lingual scenarios, though it is incremental as it builds on the existing DSI paradigm.

The paper tackles the data distribution mismatch between indexing and retrieval in Differentiable Search Index (DSI) models, where long documents are indexed but short queries are used for retrieval, especially in cross-lingual settings. The proposed DSI-QG framework uses generated queries to represent documents during indexing, significantly outperforming the original DSI model on mono-lingual and cross-lingual passage retrieval datasets.

The Differentiable Search Index (DSI) is an emerging paradigm for information retrieval. Unlike traditional retrieval architectures where index and retrieval are two different and separate components, DSI uses a single transformer model to perform both indexing and retrieval. In this paper, we identify and tackle an important issue of current DSI models: the data distribution mismatch that occurs between the DSI indexing and retrieval processes. Specifically, we argue that, at indexing, current DSI methods learn to build connections between the text of long documents and the identifier of the documents, but then retrieval of document identifiers is based on queries that are commonly much shorter than the indexed documents. This problem is further exacerbated when using DSI for cross-lingual retrieval, where document text and query text are in different languages. To address this fundamental problem of current DSI models, we propose a simple yet effective indexing framework for DSI, called DSI-QG. When indexing, DSI-QG represents documents with a number of potentially relevant queries generated by a query generation model and re-ranked and filtered by a cross-encoder ranker. The presence of these queries at indexing allows the DSI models to connect a document identifier to a set of queries, hence mitigating data distribution mismatches present between the indexing and the retrieval phases. Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes