SEAIMay 18

ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

arXiv:2605.2025177.7
Predicted impact top 18% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For developers of LLM coding agents, ProcBench provides a structured way to detect recurrent process failures, addressing a gap in existing benchmarks that focus only on final task completion.

ProcBench introduces a framework to evaluate process-level defects and control preservation in LLM coding agents, moving beyond final outcomes. Applied to 200 trajectories across three benchmarks, it shows improved interpretability and diagnostic distinctions over outcome-based evaluation.

Existing benchmarks for LLM coding agents mainly evaluate final outcomes, such as task completion, compilation success, and test pass rates. While these metrics are useful for measuring end-task capability, they provide limited visibility into how an execution unfolds and often miss recurrent process-level failures that arise during multi-step operation. We present ProcBench, a benchmark-oriented framework for evaluating coding-agent trajectories through process defects and control preservation. ProcBench organizes execution failures into a reusable ontology, standardizes heterogeneous logs into a unified trajectory representation, and reports calibrated risk-based scorecards instead of relying only on final outcomes. We instantiate ProcBench on an annotated set of 200 trajectories and apply it across three coding-agent benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Our results suggest that ProcBench can be instantiated with useful reliability, that calibration improves the empirical interpretability of defect findings relative to direct thresholding, and that process-aware scorecards provide diagnostic distinctions beyond conventional outcome-based evaluation. We also discuss limitations, including annotation dependence, partial observability for some defect classes, and the need for broader external validation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes