SE AIMay 18

ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

Jiawei He, Jie Jia, Chenbo Liu, Chaoyi Xue, Yapeng Song, Xikai Yang, Dong Sun

arXiv:2605.2025177.7

Predicted impact top 18% in SE · last 90 daysOriginality Incremental advance

AI Analysis

For developers of LLM coding agents, ProcBench provides a structured way to detect recurrent process failures, addressing a gap in existing benchmarks that focus only on final task completion.

ProcBench introduces a framework to evaluate process-level defects and control preservation in LLM coding agents, moving beyond final outcomes. Applied to 200 trajectories across three benchmarks, it shows improved interpretability and diagnostic distinctions over outcome-based evaluation.

Existing benchmarks for LLM coding agents mainly evaluate final outcomes, such as task completion, compilation success, and test pass rates. While these metrics are useful for measuring end-task capability, they provide limited visibility into how an execution unfolds and often miss recurrent process-level failures that arise during multi-step operation. We present ProcBench, a benchmark-oriented framework for evaluating coding-agent trajectories through process defects and control preservation. ProcBench organizes execution failures into a reusable ontology, standardizes heterogeneous logs into a unified trajectory representation, and reports calibrated risk-based scorecards instead of relying only on final outcomes. We instantiate ProcBench on an annotated set of 200 trajectories and apply it across three coding-agent benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Our results suggest that ProcBench can be instantiated with useful reliability, that calibration improves the empirical interpretability of defect findings relative to direct thresholding, and that process-aware scorecards provide diagnostic distinctions beyond conventional outcome-based evaluation. We also discuss limitations, including annotation dependence, partial observability for some defect classes, and the need for broader external validation.

View on arXiv PDF

Similar