SE AIMay 4

The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

arXiv:2605.0224482.6

AI Analysis

For researchers training software engineering agents, this paper identifies a data bottleneck and proposes a concrete data collection strategy to enable long-horizon, collaborative capabilities.

The paper argues that current software engineering agents fail at long-horizon, multi-engineer tasks because they are trained on inadequate data (GitHub scrapes, solo trajectories, or human-AI logs). It proposes 'triadic data'—synchronized human-human conversations, human-AI sessions, and cross-functional work—as the necessary substrate, claiming this data is capturable in 12-18 months and key to solving four open questions in agent training.

Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables. This paper takes a position on what training data is needed to close the gap. The substrate for the next generation of SWE agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs. It is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both. We argue that the canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure. We further specify a four-tier evidence framework through which any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation. We argue that this data is capturable in 12-18 months with methods already mature in adjacent fields, that it is the empirical key to four open questions in agent training, and that the field's near-term research agenda should include it explicitly.

View on arXiv PDF

Similar