SE HCJun 1

Trust-Calibrated Code Review: A Participatory Design Study of Review Workflows for LLM-Generated Multi-File Changes

Lo Gullstrand Heander, Agnia Sergeyuk, Ilya Zakharov, Emma Söderberg, Nikita Mukhortov

arXiv:2606.0196952.0

Predicted impact top 58% in SE · last 90 daysOriginality Incremental advance

AI Analysis

For developers and tool designers, this work provides a conceptual framework for building code review tools tailored to LLM-generated multi-file changes, addressing a previously unvalidated scenario.

The study identifies trust-calibration as the central challenge in reviewing LLM-generated multi-file changes and proposes a three-level review workflow with seven design constructs. In a validation survey, all workflow levels scored above the neutral midpoint (3.50–3.91 on a 5-point scale), with 63% of respondents expecting reduced overall review effort and 52% reduced trust-assessment effort.

Background: Developers increasingly review multi-file code changes generated by LLM-based agents, yet no validated end-to-end workflow or IDE tooling design exists for this scenario. Aims: We investigate (RQ1) the challenges developers face when reviewing LLM-generated multi-file changes and (RQ2) how developers envision effective workflows for this task. Method: In collaboration with JetBrains, we conducted a participatory design study structured using the double-diamond design process with Discover, Define, Develop, and Deliver phases. Industry practitioners participated in the Discover phase (N=17); seven of these returned for the Develop phase. The Define phase was an author-led synthesis. The Deliver phase produced a conceptual design and a high-fidelity semi-interactive prototype evaluated through a follow-up survey with N=43 practitioners. Results: Participants identified trust-calibration as the central challenge. The study yielded a three-level review workflow (overview, file-analysis, code snippet review) supported by seven design constructs (chunk, risk-per-line, risk-per-file, judge, walk-through, zooming in/out, and security cage). In the validation survey, all three workflow levels scored above the neutral midpoint (means 3.50--3.91 on a five-point scale). Of the respondents, 63% expected reduced overall review effort, and 52% reduced trust-assessment effort, relative to their current tools. These findings suggest that the design constructs indicate a positive direction for future tool development. Conclusions: Reviewing LLM-generated multi-file changes is a trust-calibration problem rather than a diffing problem. The three-level workflow and the seven constructs we report give tool designers a conceptual framework for building AI-ready code review tools that surface risk and confidence signals at the granularity at which developers allocate attention.

View on arXiv PDF

Similar