IRCLDec 20, 2022

Precise Zero-Shot Dense Retrieval without Relevance Labels

CMU
arXiv:2212.10496v1705 citationsh-index: 87
Originality Highly original
AI Analysis

This addresses the challenge of creating effective zero-shot dense retrieval systems for tasks like web search and QA when no labeled data is available, representing a novel approach rather than an incremental improvement.

The paper tackles the problem of zero-shot dense retrieval without relevance labels by proposing HyDE, which generates a hypothetical document from a query and then retrieves similar real documents using an unsupervised encoder. Experiments show HyDE significantly outperforms state-of-the-art unsupervised retrievers and achieves performance comparable to fine-tuned models across multiple tasks and languages.

While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes