CLAIIRDec 11, 2023

Dense X Retrieval: What Retrieval Granularity Should We Use?

arXiv:2312.06648v3119 citationsh-index: 17EMNLP
Originality Incremental advance
AI Analysis

This work addresses a design choice in dense retrieval that impacts performance for NLP practitioners, offering an incremental improvement over existing methods.

The paper tackles the problem of choosing the optimal retrieval unit for dense retrieval in open-domain NLP tasks, finding that using fine-grained propositions as retrieval units significantly outperforms passage-level units in retrieval tasks and improves downstream QA performance within a given computation budget.

Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented language models improves the performance of downstream QA tasks given a specific computation budget.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes