IRCLApr 30

A Reproducibility Study of LLM-Based Query Reformulation

arXiv:2604.2742114.4Has Code
Predicted impact top 50% in IR · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers in information retrieval, this work clarifies the reproducibility and limits of reported gains from LLM-based query reformulation, showing that many findings are not robust across different retrieval paradigms.

This paper systematically reproduces and compares ten LLM-based query reformulation methods under a unified framework, finding that gains are strongly conditioned on the retrieval paradigm and do not consistently transfer from lexical to neural retrievers, with larger LLMs not uniformly improving performance.

Large Language Models (LLMs) are now widely used for query reformulation and expansion in Information Retrieval, with many studies reporting substantial effectiveness gains. However, these results are typically obtained under heterogeneous experimental conditions, making it difficult to assess which findings are reproducible and which depend on specific implementation choices. In this work, we present a systematic reproducibility and comparative study of ten representative LLM-based query reformulation methods under a unified and strictly controlled experimental framework. We evaluate methods across two architectural LLM families at two parameter scales, three retrieval paradigms (lexical, learned sparse, and dense), and nine benchmark datasets spanning TREC Deep Learning and BEIR. Our results show that reformulation gains are strongly conditioned on the retrieval paradigm, that improvements observed under lexical retrieval do not consistently transfer to neural retrievers, and that larger LLMs do not uniformly yield better downstream performance. These findings clarify the stability and limits of reported gains in prior work. To enable transparent replication and ongoing comparison, we release all prompts, configurations, evaluation scripts, and run files through QueryGym, an open-source reformulation toolkit with a public leaderboard.\footnote{https://leaderboard.querygym.com}

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes