IRApr 3

BridgeRAG: Training-Free Bridge-Conditioned Retrieval for Multi-Hop Question Answering

arXiv:2604.033845.5

AI Analysis

For multi-hop QA systems, BridgeRAG provides a simple, training-free method to improve retrieval accuracy by conditioning on bridge evidence, outperforming existing training-free approaches.

BridgeRAG introduces a training-free, graph-free retrieval method for multi-hop QA that conditions later-hop retrieval on bridge evidence, achieving state-of-the-art training-free R@5 on MuSiQue (0.8146, +3.1pp vs. PropRAG), 2WikiMultiHopQA (0.9527, +1.2pp), and HotpotQA (0.9875, +1.35pp).

Multi-hop retrieval is not a single-step relevance problem: later-hop evidence should be ranked by its utility conditioned on retrieved bridge evidence, not by similarity to the original query alone. We present BridgeRAG, a training-free, graph-free retrieval method for retrieval-augmented generation (RAG) over multi-hop questions that operationalizes this view with a tripartite scorer s(q,b,c) over (question, bridge, candidate). BridgeRAG separates coverage from scoring: dual-entity ANN expansion broadens the second-hop candidate pool, while a bridge-conditioned LLM judge identifies the active reasoning chain among competing candidates without any offline graph or proposition index. Across four controlled experiments we show that this conditioning signal is (i) selective: +2.55pp on parallel-chain queries (p<0.001) vs. ~0 on single-chain subtypes; (ii) irreplaceable: substituting the retrieved passage with generated SVO query text reduces R@5 by 2.1pp, performing worse than even the lowest-SVO-similarity pool passage; (iii) predictable: cos(b,g2) correlates with per-query gain (Spearman rho=0.104, p<0.001); and (iv) mechanistically precise: bridge conditioning causes productive re-rankings (18.7% flip-win rate on parallel-chain vs. 0.6% on single-chain), not merely more churn. Combined with lightweight coverage expansion and percentile-rank score fusion, BridgeRAG achieves the best published training-free R@5 under matched benchmark evaluation on all three standard MHQA benchmarks without a graph database or any training: 0.8146 on MuSiQue (+3.1pp vs. PropRAG, +6.8pp vs. HippoRAG2), 0.9527 on 2WikiMultiHopQA (+1.2pp vs. PropRAG), and 0.9875 on HotpotQA (+1.35pp vs. PropRAG).

View on arXiv PDF

Similar