SELGFeb 17

The Limits of Long-Context Reasoning in Automated Bug Fixing

arXiv:2602.16069v11 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This work addresses a critical gap in software engineering for developers and researchers by showing that current LLMs have limited usable context capacity, despite nominal increases, which is incremental as it builds on existing benchmarks.

The study systematically evaluated whether current large language models (LLMs) can reliably perform long-context code debugging and patch generation, finding that while agentic workflows improve performance (e.g., GPT-5-nano achieves up to 31% resolve rate), direct long-context reasoning leads to sharp degradation (e.g., Qwen3-Coder-30B-A3B achieves only 7% resolve rate at 64k context).

Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether current LLMs can reliably perform long-context code debugging and patch generation. Using SWE-bench Verified as a controlled experimental setting, we first evaluate state-of-the-art models within an agentic harness (mini-SWE-agent), where performance improves substantially: GPT-5-nano achieves up to a 31\% resolve rate on 100 samples, and open-source models such as Deepseek-R1-0528 obtain competitive results. However, token-level analysis shows that successful agentic trajectories typically remain under 20k tokens, and that longer accumulated contexts correlate with lower success rates, indicating that agentic success primarily arises from task decomposition into short-context steps rather than effective long-context reasoning. To directly test long-context capability, we construct a data pipeline where we artificially inflate the context length of the input by placing the relevant files into the context (ensuring perfect retrieval recall); we then study single-shot patch generation under genuinely long contexts (64k-128k tokens). Despite this setup, performance degrades sharply: Qwen3-Coder-30B-A3B achieves only a 7\% resolve rate at 64k context, while GPT-5-nano solves none of the tasks. Qualitative analysis reveals systematic failure modes, including hallucinated diffs, incorrect file targets, and malformed patch headers. Overall, our findings highlight a significant gap between nominal context length and usable context capacity in current LLMs, and suggest that existing agentic coding benchmarks do not meaningfully evaluate long-context reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes