IRLGFeb 10

The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training

arXiv:2602.09448v11 citationsh-index: 8
AI Analysis

This work addresses a specific problem in dense retrieval for multi-hop tasks, providing incremental insights with practical guidelines for synthetic data generation.

The paper tackled conflicting results on query diversity in dense retrieval by introducing metrics to measure its impact, finding that diversity benefits multi-hop retrieval and correlates strongly with query complexity, formalized as the Complexity-Diversity Principle with actionable thresholds.

Prior work reports conflicting results on query diversity in synthetic data generation for dense retrieval. We identify this conflict and design Q-D metrics to quantify diversity's impact, making the problem measurable. Through experiments on 4 benchmark types (31 datasets), we find query diversity especially benefits multi-hop retrieval. Deep analysis on multi-hop data reveals that diversity benefit correlates strongly with query complexity ($r$$\geq$0.95, $p$$<$0.05 in 12/14 conditions), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides actionable thresholds (CW$>$10: use diversity; CW$<$7: avoid it). Guided by CDP, we propose zero-shot multi-query synthesis for multi-hop tasks, achieving state-of-the-art performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes