IRSep 22, 2020

Embedding-based Zero-shot Retrieval through Query Generation

Davis Liang, Peng Xu, Siamak Shakeri, Cicero Nogueira dos Santos, Ramesh Nallapati, Zhiheng Huang, Bing Xiang

arXiv:2009.10270v117.353 citationsHas Code

Originality Highly original

AI Analysis

This addresses the data scarcity issue in neural retrieval for passage retrieval, offering a practical solution for improving retrieval accuracy without extensive labeled datasets.

The paper tackles the problem of training neural retrieval models when labeled data is scarce by proposing a novel method for generating synthetic training data, resulting in significant performance improvements over BM25 on most datasets, with an average gain of 2.45 points in Recall@1.

Passage retrieval addresses the problem of locating relevant passages, usually from a large corpus, given a query. In practice, lexical term-matching algorithms like BM25 are popular choices for retrieval owing to their efficiency. However, term-based matching algorithms often miss relevant passages that have no lexical overlap with the query and cannot be finetuned to downstream datasets. In this work, we consider the embedding-based two-tower architecture as our neural retrieval model. Since labeled data can be scarce and because neural retrieval models require vast amounts of data to train, we propose a novel method for generating synthetic training data for retrieval. Our system produces remarkable results, significantly outperforming BM25 on 5 out of 6 datasets tested, by an average of 2.45 points for Recall@1. In some cases, our model trained on synthetic data can even outperform the same model trained on real data

View on arXiv PDF Code

Similar