SEAIOct 9, 2025

Repository-Aware File Path Retrieval via Fine-Tuned LLMs

arXiv:2510.08850v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the challenge for developers and AI coding assistants in efficiently navigating complex codebases, representing an incremental improvement over traditional code search methods.

The paper tackles the problem of retrieving relevant source files from a codebase using natural language queries by fine-tuning an LLM with repository-aware strategies, achieving up to 91% exact match and 93% recall on held-out queries, with 59% recall on a large codebase like PyTorch.

Modern codebases make it hard for developers and AI coding assistants to find the right source files when answering questions like "How does this feature work?" or "Where was the bug introduced?" Traditional code search (keyword or IR based) often misses semantic context and cross file links, while large language models (LLMs) understand natural language but lack repository specific detail. We present a method for file path retrieval that fine tunes a strong LLM (Qwen3-8B) with QLoRA and Unsloth optimizations to predict relevant file paths directly from a natural language query. To build training data, we introduce six code aware strategies that use abstract syntax tree (AST) structure and repository content to generate realistic question-answer pairs, where answers are sets of file paths. The strategies range from single file prompts to hierarchical repository summaries, providing broad coverage. We fine tune on Python projects including Flask, Click, Jinja, FastAPI, and PyTorch, and obtain high retrieval accuracy: up to 91\% exact match and 93\% recall on held out queries, clearly beating single strategy training. On a large codebase like PyTorch (about 4,000 Python files), the model reaches 59\% recall, showing scalability. We analyze how multi level code signals help the LLM reason over cross file context and discuss dataset design, limits (for example, context length in very large repos), and future integration of retrieval with LLM based code intelligence.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes