InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval
This work addresses the need for efficient and accessible dataset generation in information retrieval, offering an incremental improvement over prior methods by leveraging open-source models.
The authors tackled the problem of generating synthetic query-document pairs for information retrieval by introducing InPars-v2, which uses open-source large language models and rerankers to create training data, achieving new state-of-the-art results on the BEIR benchmark with a BM25 and monoT5 pipeline.
Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu