SE AIFeb 28

Theory of Code Space: Do Code Agents Understand Software Architecture?

Grigory Sapunov

arXiv:2603.00601v1Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of evaluating architectural understanding in AI code agents, which is crucial for advancing software engineering automation, though it appears incremental as a benchmarking framework.

The paper tackles the problem of AI code agents struggling with complex multi-file software engineering by introducing Theory of Code Space (ToCS), a benchmark that evaluates agents' ability to build and maintain architectural beliefs in procedurally generated codebases under partial observability, with experiments showing performance ranging from F1 scores of 0.129 to 0.646 across different methods.

AI code agents excel at isolated tasks yet struggle with complex, multi-file software engineering requiring understanding of how dozens of modules relate. We hypothesize these failures stem from inability to construct, maintain, and update coherent architectural beliefs during codebase exploration. We introduce Theory of Code Space (ToCS), a benchmark that evaluates this capability by placing agents in procedurally generated codebases under partial observability, requiring them to build structured belief states over module dependencies, cross-cutting invariants, and design intent. The framework features: (1) a procedural codebase generator producing medium-complexity Python projects with four typed edge categories reflecting different discovery methods -- from syntactic imports to config-driven dynamic wiring -- with planted architectural constraints and verified ground truth; (2) a partial observability harness where agents explore under a budget; and (3) periodic belief probing via structured JSON, producing a time-series of architectural understanding. We decompose the Active-Passive Gap from spatial reasoning benchmarks into selection and decision components, and introduce Architectural Constraint Discovery as a code-specific evaluation dimension. Preliminary experiments with four rule-based baselines and five frontier LLM agents from three providers validate discriminative power: methods span a wide performance range (F1 from 0.129 to 0.646), LLM agents discover semantic edge types invisible to all baselines, yet weaker models score below simple heuristics -- revealing that belief externalization, faithfully serializing internal understanding into structured JSON, is itself a non-trivial capability and a first-order confounder in belief-probing benchmarks. Open-source toolkit: https://github.com/che-shr-cat/tocs

View on arXiv PDF Code

Similar