Score the Steps, Not Just the Goal: VLM-Based Subgoal Evaluation for Robotic Manipulation
This addresses the need for more detailed and transparent evaluation in robotic manipulation research, though it is incremental as it builds on existing VLM capabilities without introducing new benchmarks or methods.
The paper tackles the problem of binary success rate reporting in robot learning by proposing a framework for subgoal-level evaluation, which reveals partial competence in multi-step manipulation tasks. The result is a blueprint for StepEval, a cost-aware, VLM-based evaluation framework designed to be scalable and community-driven.
Robot learning papers typically report a single binary success rate (SR), which obscures where a policy succeeds or fails along a multi-step manipulation task. We argue that subgoal-level reporting should become routine: for each trajectory, a vector of per-subgoal SRs that makes partial competence visible (e.g., grasp vs. pour). We propose a blueprint for StepEval, a cost-aware plug-in evaluation framework that utilizes vision-language models (VLMs) as automated judges of subgoal outcomes from recorded images or videos. Rather than proposing new benchmarks or APIs, our contribution is to outline design principles for a scalable, community-driven open-source project. In StepEval, the primary artifact for policy evaluation is the per-subgoal SR vector; however, other quantities (e.g., latency or cost estimates) are also considered for framework-optimization diagnostics to help the community tune evaluation efficiency and accuracy when ground-truth subgoal success labels are available. We discuss how such a framework can remain model-agnostic, support single- or multi-view inputs, and be lightweight enough to adopt across labs. The intended contribution is a shared direction: a minimal, extensible seed that invites open-source contributions, so that scoring the steps, not just the final goal, becomes a standard and reproducible practice.