IR CLAug 16, 2023

Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval

Guangyuan Ma, Xing Wu, Peng Wang, Zijia Lin, Songlin Hu

arXiv:2308.08285v110.911 citationsh-index: 39

Originality Incremental advance

AI Analysis

This work addresses retrieval performance for web-search applications, offering a method to reduce reliance on human-labeled data, though it is incremental as it builds on existing LLM and pre-training techniques.

The paper tackles the problem of improving dense passage retrieval by pre-training with LLM-based document expansion, resulting in significant performance boosts on large-scale web-search tasks and strong zero-shot and out-of-domain retrieval abilities.

In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contrastive learning and bottlenecked query generation. Furthermore, we incorporate a curriculum learning strategy to reduce the reliance on LLM inferences. Experimental results demonstrate that pre-training with LLM-based document expansion significantly boosts the retrieval performance on large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.

View on arXiv PDF

Similar