IRAICLLGOct 2, 2025

Cluster-based Adaptive Retrieval: Dynamic Context Selection for RAG Applications

arXiv:2511.14769v11 citationsh-index: 1
Originality Highly original
AI Analysis

This addresses the inefficiency of fixed retrieval depths in RAG applications, offering significant performance improvements for systems like virtual assistants, though it is an incremental advancement over existing adaptive methods.

The paper tackles the problem of static top-k retrieval in Retrieval-Augmented Generation (RAG) by introducing Cluster-based Adaptive Retrieval (CAR), which dynamically selects the optimal number of documents based on query complexity, resulting in a 60% reduction in LLM token usage, 22% lower latency, 10% fewer hallucinations, and a 200% increase in user engagement.

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by pulling in external material, document, code, manuals, from vast and ever-growing corpora, to effectively answer user queries. The effectiveness of RAG depends significantly on aligning the number of retrieved documents with query characteristics: narrowly focused queries typically require fewer, highly relevant documents, whereas broader or ambiguous queries benefit from retrieving more extensive supporting information. However, the common static top-k retrieval approach fails to adapt to this variability, resulting in either insufficient context from too few documents or redundant information from too many. Motivated by these challenges, we introduce Cluster-based Adaptive Retrieval (CAR), an algorithm that dynamically determines the optimal number of documents by analyzing the clustering patterns of ordered query-document similarity distances. CAR detects the transition point within similarity distances, where tightly clustered, highly relevant documents shift toward less pertinent candidates, establishing an adaptive cut-off that scales with query complexity. On Coinbase's CDP corpus and the public MultiHop-RAG benchmark, CAR consistently picks the optimal retrieval depth and achieves the highest TES score, outperforming every fixed top-k baseline. In downstream RAG evaluations, CAR cuts LLM token usage by 60%, trims end-to-end latency by 22%, and reduces hallucinations by 10% while fully preserving answer relevance. Since integrating CAR into Coinbase's virtual assistant, we've seen user engagement jump by 200%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes