CL AI IRDec 11, 2023

Dense X Retrieval: What Retrieval Granularity Should We Use?

Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, Dong Yu

arXiv:2312.06648v319.2119 citationsh-index: 84Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses a design choice in dense retrieval that impacts performance for NLP practitioners, offering an incremental improvement over existing methods.

The paper tackles the problem of choosing the optimal retrieval unit for dense retrieval in open-domain NLP tasks, finding that using fine-grained propositions as retrieval units significantly outperforms passage-level units in retrieval tasks and improves downstream QA performance within a given computation budget.

Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented language models improves the performance of downstream QA tasks given a specific computation budget.

View on arXiv PDF Code

Similar