CLJun 18, 2025

TopClustRAG at SIGIR 2025 LiveRAG Challenge

arXiv:2506.15246v11 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the challenge of generating accurate and faithful answers from web-scale data for question answering systems, though it appears incremental as it builds on existing RAG approaches with clustering enhancements.

The authors tackled the problem of question answering over large-scale web corpora by developing TopClustRAG, a retrieval-augmented generation system that uses clustering to filter and aggregate retrieved passages. Their system ranked 2nd in faithfulness and 7th in correctness on the FineWeb Sample-10BT dataset in the LiveRAG Challenge.

We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes