SE AIJan 7

From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

arXiv:2601.03731v17.24 citationsh-index: 21

Originality Incremental advance

AI Analysis

This work addresses the need for better benchmarks in agentic software engineering, providing granular insights for optimizing autonomous code agents, though it is incremental as it builds on existing evaluation methods.

The paper tackled the problem of evaluating repository-level reasoning in LLM-based agents by introducing RepoReason, a white-box diagnostic benchmark that uses execution-driven mutation and dynamic program slicing, revealing that integration width is the primary bottleneck, with models showing a prevalent aggregation deficit.

As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. To eliminate memorization while preserving authentic logical depth, we implement an execution-driven mutation framework that utilizes the environment as a semantic oracle to regenerate ground-truth states. Furthermore, we establish a fine-grained diagnostic system using dynamic program slicing, quantifying reasoning via three orthogonal metrics: $ESV$ (reading load), $MCL$ (simulation depth), and $DFI$ (integration width). Comprehensive evaluations of frontier models (e.g., Claude-4.5-Sonnet, DeepSeek-v3.1-Terminus) reveal a prevalent aggregation deficit, where integration width serves as the primary cognitive bottleneck. Our findings provide granular white-box insights for optimizing the next generation of agentic software engineering.

View on arXiv PDF

Similar