Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
This addresses the challenge of data retrieval training for researchers and practitioners by offering a corpus-free alternative, though it is incremental as it builds on existing LLM and retrieval techniques.
The paper tackles the problem of training dense retrieval models without needing full corpus access by proposing an end-to-end pipeline where an LLM generates queries and hard negative examples from passages, using a dataset of 7,250 arXiv abstracts. It shows that this corpus-free method outperforms lexical baselines and achieves performance comparable to cross-encoder-based methods on BEIR benchmarks.
Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders (CE), which require full corpus access. We propose a corpus-free alternative: an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage and then produces a hard negative example using only the generated query text. Our dataset comprises 7,250 arXiv abstracts spanning diverse domains including mathematics, physics, computer science, and related fields, serving as positive passages for query generation. We evaluate two fine-tuning configurations of DistilBERT for dense retrieval; one using LLM-generated hard negatives conditioned solely on the query, and another using negatives generated with both the query and its positive document as context. Compared to traditional corpus-based mining methods {LLM Query $\rightarrow$ BM25 HN and LLM Query $\rightarrow$ CE HN on multiple BEIR benchmark datasets, our all-LLM pipeline outperforms strong lexical mining baselines and achieves performance comparable to cross-encoder-based methods, demonstrating the potential of corpus-free hard negative generation for retrieval model training.