AILGMar 4

A Rubric-Supervised Critic from Sparse Real-World Outcomes

CMU
arXiv:2603.03800v1h-index: 14
Originality Incremental advance
AI Analysis

This addresses the gap between academic benchmarks and real-world coding agents for developers, though it is incremental as it builds on existing methods for sparse feedback.

The paper tackles the problem of training coding agents with sparse and noisy real-world feedback by learning a critic model from interaction traces, which improves best-of-N reranking on SWE-bench by +15.9 and enables early stopping with +17.7 gain using 83% fewer attempts.

Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes