LGSEMay 8

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

arXiv:2605.0836683.6
AI Analysis

Provides a complementary evaluation suite for coding agents, addressing underrepresented task categories and emphasizing engineering quality beyond functional correctness.

SWE Atlas introduces a benchmark for coding agents covering Codebase Q&A, Test Writing, and Refactoring tasks, evaluating both correctness and software engineering quality. Top models like GPT-5.4 and Opus 4.7 perform best, but even they struggle with edge cases and best practices.

We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE benchmarks in three key ways: it targets underrepresented but practically important task categories, uses comprehensive category-specific evaluation protocols, and adopts under-specified, agentic task formulations that better reflect real-world usage. Its evaluation framework combines programmatic checks with rubric-based assessment. This goes beyond functional correctness, evaluating software engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene. We evaluate a range of frontier and open-weight models on SWE Atlas and find that GPT-5.4 and Opus 4.7 achieve the strongest overall performance, while even the best open-weight models score poorly. Our analysis suggests that top models rely on extensive codebase exploration and runtime-driven reasoning. However, even top models consistently struggle with subtle edge cases, complex runtime analysis, and adherence to software engineering best practices. Overall, SWE Atlas provides a complementary evaluation suite for measuring both correctness and engineering quality in coding agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes