LGMay 30, 2025

Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

Kaivalya Hariharan, Uzay Girit, Atticus Wang, Jacob Andreas

arXiv:2506.00172v19.42 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the need for scalable and varied benchmarks for system-level reasoning in LLMs, particularly for software engineering and scientific research tasks, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the problem of evaluating long-horizon reasoning in LLM code agents by introducing Breakpoint, a benchmarking methodology that automatically generates code-repair tasks through adversarial corruption of real-world software, resulting in success rates for state-of-the-art models ranging from 55% on easy tasks to 0% on the hardest across over 900 tasks.

Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve. We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two clear dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions). In experiments across more than 900 generated tasks we demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.

View on arXiv PDF

Similar