SpIDER: Spatially Informed Dense Embedding Retrieval for Software Issue Localization
This addresses the challenge of improving retrieval accuracy for LLM-based coding agents in large codebases, though it appears incremental as it builds on existing dense embedding methods by adding graph-based features.
The paper tackled the problem of retrieving relevant code units for software issue localization by proposing SpIDER, an enhanced dense retrieval approach that integrates LLM-based reasoning with graph-based exploration of codebases, resulting in at least 13% performance improvement across programming languages and benchmarks.
Retrieving code functions, classes or files that are relevant in order to solve a given user query, bug report or feature request from large codebases is a fundamental challenge for Large Language Model (LLM)-based coding agents. Agentic approaches typically employ sparse retrieval methods like BM25 or dense embedding strategies to identify semantically relevant units. While embedding-based approaches can outperform BM25 by large margins, they often don't take into consideration the underlying graph-structured characteristics of the codebase. To address this, we propose SpIDER (Spatially Informed Dense Embedding Retrieval), an enhanced dense retrieval approach that integrates LLM-based reasoning along with auxiliary information obtained from graph-based exploration of the codebase. We further introduce SpIDER-Bench, a graph-structured evaluation benchmark curated from SWE-PolyBench, SWEBench-Verified and Multi-SWEBench, spanning codebases from Python, Java, JavaScript and TypeScript programming languages. Empirical results show that SpIDER consistently improves dense retrieval performance by at least 13% across programming languages and benchmarks in SpIDER-Bench.