CLIRJun 21, 2022

Questions Are All You Need to Train a Dense Passage Retriever

MILAUW
arXiv:2206.10658v4252 citationsh-index: 116
Originality Highly original
AI Analysis

This addresses the challenge of reducing reliance on large supervised datasets for open-domain tasks like Open QA, offering a more efficient and scalable solution.

The paper tackles the problem of training dense retrieval models without labeled data by introducing ART, a corpus-level autoencoding approach that uses question reconstruction for unsupervised learning, achieving state-of-the-art results on multiple QA retrieval benchmarks.

We introduce ART, a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a central challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in contrast, only requires access to unpaired inputs and outputs (e.g. questions and potential answer documents). It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question. Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorporated into complete Open QA systems without any further finetuning. Extensive experiments demonstrate that ART obtains state-of-the-art results on multiple QA retrieval benchmarks with only generic initialization from a pre-trained language model, removing the need for labeled data and task-specific losses.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes