CL SEMar 3

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu

arXiv:2603.03194v13.08 citationsh-index: 6

Originality Highly original

AI Analysis

This work addresses the problem of limited code agent capabilities for software developers and researchers, providing a more realistic evaluation benchmark and framework for advancing code agent research.

The authors tackled the problem of code agents' limited capabilities beyond single-repo bug fixing, finding that even state-of-the-art models achieve below 45% success in more comprehensive tasks. The BeyondSWE benchmark and SearchSWE framework were introduced to address this gap.

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

View on arXiv PDF

Similar