CLIRAug 22, 2025

How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models

arXiv:2508.16757v110 citationsh-index: 10Has CodeEMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses the performance and generalization of reranking models for information retrieval practitioners, but it is incremental as it focuses on empirical comparison without introducing new methods.

The paper systematically evaluated 22 reranking methods, including LLM-based and lightweight models, across benchmarks like TREC DL19 and BEIR, finding that LLM-based rerankers perform better on familiar queries but have variable generalization to novel queries, with lightweight models offering similar efficiency.

In this work, we present a systematic and comprehensive empirical evaluation of state-of-the-art reranking methods, encompassing large language model (LLM)-based, lightweight contextual, and zero-shot approaches, with respect to their performance in information retrieval tasks. We evaluate in total 22 methods, including 40 variants (depending on used LLM) across several established benchmarks, including TREC DL19, DL20, and BEIR, as well as a novel dataset designed to test queries unseen by pretrained models. Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts, particularly on novel queries, and to elucidate the underlying causes of any observed differences. To disentangle confounding factors, we analyze the effects of training data overlap, model architecture, and computational efficiency on reranking performance. Our findings indicate that while LLM-based rerankers demonstrate superior performance on familiar queries, their generalization ability to novel queries varies, with lightweight models offering comparable efficiency. We further identify that the novelty of queries significantly impacts reranking effectiveness, highlighting limitations in existing approaches. https://github.com/DataScienceUIBK/llm-reranking-generalization-study

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes