SEAIPLMar 28

Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP

arXiv:2603.2727772.42 citationsh-index: 6Has Code
Predicted impact top 20% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For LLM coding agents, this system reduces token and tool call overhead while maintaining competitive answer quality, offering a more efficient code exploration method.

Codebase-Memory constructs a persistent knowledge graph from codebases using Tree-Sitter and MCP, achieving 83% answer quality with 10x fewer tokens and 2.1x fewer tool calls compared to a file-exploration agent across 31 repositories.

Large Language Model (LLM) coding agents typically explore codebases through repeated file-reading and grep-searching, consuming thousands of tokens per query without structural understanding. We present Codebase-Memory, an open-source system that constructs a persistent, Tree-Sitter-based knowledge graph via the Model Context Protocol (MCP), parsing 66 languages through a multi-phase pipeline with parallel worker pools, call-graph traversal, impact analysis, and community discovery. Evaluated across 31 real-world repositories, Codebase-Memory achieves 83% answer quality versus 92% for a file-exploration agent, at ten times fewer tokens and 2.1 times fewer tool calls. For graph-native queries such as hub detection and caller ranking, it matches or exceeds the explorer on 19 of 31 languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes