SE AIMar 20

Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs

arXiv:2603.2002839.21 citationsh-index: 1

AI Analysis

This addresses the problem of inefficient software delivery for teams in industrial settings by demonstrating that orchestrated AI workflows yield larger gains than isolated AI tools, though it is incremental as it builds on existing coordination concepts.

The study tackled the problem of scarce evidence on team-level AI integration in software engineering by evaluating Chiron, a platform that orchestrates human-AI collaboration across software delivery stages in three real modernization programs. The results showed significant improvements: portfolio totals reduced from 36.0 to 9.3 project-weeks, effort fell from 1080.0 to 232.5 person-days, validation issues dropped from 8.03 to 2.09 per 100 tasks, and first-release coverage increased from 77.0% to 90.5%.

Evidence on AI in software engineering still leans heavily toward individual task completion, while evidence on team-level delivery remains scarce. We report a retrospective longitudinal field study of Chiron, an industrial platform that coordinates humans and AI agents across four delivery stages: analysis, planning, implementation, and validation. The study covers three real software modernization programs -- a COBOL banking migration (~30k LOC), a large accounting modernization (~400k LOC), and a .NET/Angular mortgage modernization (~30k LOC) -- observed across five delivery configurations: a traditional baseline and four successive platform versions (V1--V4). The benchmark separates observed outcomes (stage durations, task volumes, validation-stage issues, first-release coverage) from modeled outcomes (person-days and senior-equivalent effort under explicit staffing scenarios). Under baseline staffing assumptions, portfolio totals move from 36.0 to 9.3 summed project-weeks; modeled raw effort falls from 1080.0 to 232.5 person-days; modeled senior-equivalent effort falls from 1080.0 to 139.5 SEE-days; validation-stage issue load falls from 8.03 to 2.09 issues per 100 tasks; and first-release coverage rises from 77.0% to 90.5%. V3 and V4 add acceptance-criteria validation, repository-native review, and hybrid human-agent execution, simultaneously improving speed, coverage, and issue load. The evidence supports a central thesis: the largest gains appear when AI is embedded in an orchestrated workflow rather than deployed as an isolated coding assistant.

View on arXiv PDF

Similar