Embedding-based Zero-shot Retrieval through Query Generation
This addresses the data scarcity issue in neural retrieval for passage retrieval, offering a practical solution for improving retrieval accuracy without extensive labeled datasets.
The paper tackles the problem of training neural retrieval models when labeled data is scarce by proposing a novel method for generating synthetic training data, resulting in significant performance improvements over BM25 on most datasets, with an average gain of 2.45 points in Recall@1.
Passage retrieval addresses the problem of locating relevant passages, usually from a large corpus, given a query. In practice, lexical term-matching algorithms like BM25 are popular choices for retrieval owing to their efficiency. However, term-based matching algorithms often miss relevant passages that have no lexical overlap with the query and cannot be finetuned to downstream datasets. In this work, we consider the embedding-based two-tower architecture as our neural retrieval model. Since labeled data can be scarce and because neural retrieval models require vast amounts of data to train, we propose a novel method for generating synthetic training data for retrieval. Our system produces remarkable results, significantly outperforming BM25 on 5 out of 6 datasets tested, by an average of 2.45 points for Recall@1. In some cases, our model trained on synthetic data can even outperform the same model trained on real data