Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets
This work addresses practical implementation challenges for code-oriented RAG systems, providing incremental recommendations based on task requirements and computational efficiency.
The study tackled retrieval design for code generation tasks under compute budgets, finding that sparse BM25 with word-level splitting is most effective and practical for PL-PL tasks, while proprietary dense encoders perform better for NL-PL tasks but with much higher latency, and optimal chunk size depends on context availability.
We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.