CLNov 14, 2023

It's All Relative! -- A Synthetic Query Generation Approach for Improving Zero-Shot Relevance Prediction

Aditi Chaudhary, Karthik Raman, Michael Bendersky

arXiv:2311.07930v111.134 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work addresses the challenge of building better information retrieval models in zero-shot settings where training data is scarce, offering an incremental improvement over existing synthetic query generation methods.

The paper tackles the problem of generating synthetic query-document pairs for improving zero-shot relevance prediction by proposing a method that generates queries simultaneously for different labels, reducing the reasoning burden on large language models. The result shows that this approach leads to better downstream performance across seven IR datasets, indicating higher-quality synthetic queries.

Recent developments in large language models (LLMs) have shown promise in their ability to generate synthetic query-document pairs by prompting with as few as 8 demonstrations. This has enabled building better IR models, especially for tasks with no training data readily available. Typically, such synthetic query generation (QGen) approaches condition on an input context (e.g. a text document) and generate a query relevant to that context, or condition the QGen model additionally on the relevance label (e.g. relevant vs irrelevant) to generate queries across relevance buckets. However, we find that such QGen approaches are sub-optimal as they require the model to reason about the desired label and the input from a handful of examples. In this work, we propose to reduce this burden of LLMs by generating queries simultaneously for different labels. We hypothesize that instead of asking the model to generate, say, an irrelevant query given an input context, asking the model to generate an irrelevant query relative to a relevant query is a much simpler task setup for the model to reason about. Extensive experimentation across seven IR datasets shows that synthetic queries generated in such a fashion translates to a better downstream performance, suggesting that the generated queries are indeed of higher quality.

View on arXiv PDF

Similar