CLIRDec 20, 2022

What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary

DeepMind
arXiv:2212.10380v2241 citationsh-index: 59
AI Analysis

This work addresses the interpretability and performance limitations of dense retrievers for information retrieval tasks, offering an incremental improvement with practical benefits.

The authors investigated how dual encoders represent text in dense retrieval by projecting vector representations into vocabulary space, revealing that these projections contain semantic information and explain failure cases like handling tail entities. They proposed enriching representations with lexical information at inference, which significantly improved zero-shot performance on the BEIR benchmark.

Dual encoders are now the dominant architecture for dense retrieval. Yet, we have little understanding of how they represent text, and why this leads to good performance. In this work, we shed light on this question via distributions over the vocabulary. We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval. We find that this view can offer an explanation for some of the failure cases of dense retrievers. For example, we observe that the inability of models to handle tail entities is correlated with a tendency of the token distributions to forget some of the tokens of those entities. We leverage this insight and propose a simple way to enrich query and passage representations with lexical information at inference time, and show that this significantly improves performance compared to the original model in zero-shot settings, and specifically on the BEIR benchmark.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes