SEAIApr 9

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

arXiv:2602.0790082.73 citationsh-index: 11
AI Analysis

This work addresses the efficiency and cost-effectiveness of testing practices for developers using LLM agents in software engineering, revealing that current methods are largely incremental and may not justify the interaction budget.

The study investigated whether agent-generated tests improve issue resolution in LLM-based software engineering agents, finding that test writing is common but does not significantly affect final outcomes, with prompt-induced changes in test volume showing no meaningful impact on performance.

Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, but the value of this behavior remains unclear. For example, GPT-5.2 writes almost no new tests yet achieves performance comparable to top-ranking agents.This raises a central question: do such tests meaningfully improve issue resolution, or do they mainly mimic a familiar software-development practice while consuming interaction budget? To better understand the role of agent-written tests, we analyze trajectories produced by six strong LLMs on SWE-bench Verified. Our results show that test writing is common, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies. When tests are written, they mainly serve as observational feedback channels, with value-revealing print statements appearing much more often than assertion-based checks. Based on these insights, we perform a prompt-intervention study by revising the prompts used with four models to either increase or reduce test writing. The results suggest that prompt-induced changes in the volume of agent-written tests do not significantly change final outcomes in this setting. Taken together, these results suggest that current agent-written testing practices reshape process and cost more than final task outcomes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes