IRCLJun 2

Can LLM Rerankers Predict Their Own Ranking Performance?

arXiv:2606.0353583.0h-index: 23
AI Analysis

For information retrieval researchers, this work provides a novel approach to query performance prediction that leverages internal signals from LLM rerankers, offering a training-free method competitive with existing SOTA and supervised methods for better calibration.

The paper investigates whether LLM rerankers can predict their own ranking performance, finding that self-consistency across sampled rankings is competitive with state-of-the-art QPP methods and better calibrated, while verbalized confidence is overconfident. Two supervised methods, Verb-Num and Verb-List, improve calibration with minimal overhead.

Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available. Query performance prediction (QPP) addresses this need, but most existing methods rely on external predictors after retrieval or reranking. In this paper, we study \textit{reranker-internal QPP}: can an LLM reranker estimate the quality of the ranking it has just produced? We investigate both training-free and training-based approaches. For training-free estimation, we examine metric-specific self-consistency across sampled rankings and verbalized confidence produced directly by the reranker. Experiments on TREC Deep Learning 2019--2022 with four LLMs show that self-consistency is competitive with the state-of-the-art (SOTA) approach and better calibrated in almost all settings, while direct verbalized confidence is severely overconfident. To improve verbalized confidence, we propose two supervised methods, Verb-Num and Verb-List, which enable LLM rerankers to produce calibrated ranking-quality estimates with only a few additional output tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes