LGFeb 5

BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs

Sheshansh Agrawal, Thien Hang Nguyen, Douwe Kiela

arXiv:2602.05448v21.4h-index: 23

Originality Highly original

AI Analysis

This work addresses the challenge of efficient ranking in applications like LLM-based document reranking and crowdsourced evaluation, offering a principled solution with significant token savings.

The paper tackled the problem of selecting top items via expensive multi-item comparisons by introducing a tournament graph framework that aggregates pairwise preferences from each comparison to determine ranks without additional queries. The method achieved Pareto dominance in LLM reranking, matching or exceeding accuracy with 25-40% fewer tokens than comparable approaches and 7x fewer than pairwise reranking at similar quality.

Selecting the top $m$ from $n$ items via expensive $k$-wise comparisons is fundamental to settings ranging from LLM-based document reranking to crowdsourced evaluation and tournament design. Existing methods either rely on heuristics that fail to fully exploit the information each comparison reveals, or are inefficient when they do. We introduce a tournament graph framework that provides a principled foundation for $k$-wise ranking. Our key observation is that each $k$-item comparison reveals a complete tournament of $\binom{k}{2}$ pairwise preferences; aggregating these into a global preference graph and computing its transitive closure yields many additional orderings without further oracle calls. We formalize when an item's rank is certifiably determined and design a greedy query schedule that maximizes information gain towards identifying the top-$m$ items. The framework also gracefully handles non-transitive preferences (cycles induced by real-world oracles) by collapsing them into equivalence classes that yield principled tiered rankings. Applied to LLM reranking across 14 benchmarks and 5 models, our method achieves Pareto dominance over existing approaches: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable methods, and $7\times$ fewer than pairwise reranking at near-identical quality.

View on arXiv PDF

Similar