IRAICVMMMay 12

Very Efficient Listwise Multimodal Reranking for Long Documents

arXiv:2605.1186432.8Has Code
Predicted impact top 11% in IR · last 90 daysOriginality Highly original
AI Analysis

For practitioners of multimodal retrieval-augmented generation over long documents, ZipRerank offers a practical solution that dramatically improves efficiency without sacrificing accuracy.

ZipRerank achieves state-of-the-art reranking accuracy on the MMDocIR benchmark while reducing LLM inference latency by up to an order of magnitude, addressing computational bottlenecks in multimodal reranking for long documents.

Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes