ROAPMar 13

Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

arXiv:2603.1361687.6h-index: 15
AI Analysis

This work addresses the need for reliable and efficient policy comparison in robotics, particularly for generalist manipulation policies, by providing a unified approach that handles various metrics beyond binary success, though it is incremental as it builds on existing sequential inference methods.

The paper tackles the problem of efficiently and rigorously evaluating robot manipulation policies under resource constraints by introducing a sample-efficient, statistically rigorous framework based on safe, anytime-valid inference, which reduces evaluation burden by up to 70% compared to standard methods and up to 50% compared to state-of-the-art sequential procedures.

Generalist robot manipulation policies are becoming increasingly capable, but are limited in evaluation to a small number of hardware rollouts. This strong resource constraint in real-world testing necessitates both more informative performance measures and reliable and efficient evaluation procedures to properly assess model capabilities and benchmark progress in the field. This work presents a novel framework for robot policy comparison that is sample-efficient, statistically rigorous, and applicable to a broad set of evaluation metrics used in practice. Based on safe, anytime-valid inference (SAVI), our test procedure is sequential, allowing the evaluator to stop early when sufficient statistical evidence has accumulated to reach a decision at a pre-specified level of confidence. Unlike previous work developed for binary success, our unified approach addresses a wide range of informative metrics: from discrete partial credit task progress to continuous measures of episodic reward or trajectory smoothness, spanning both parametric and nonparametric comparison problems. Through extensive validation on simulated and real-world evaluation data, we demonstrate up to 70% reduction in evaluation burden compared to standard batch methods and up to 50% reduction compared to state-of-the-art sequential procedures designed for binary outcomes, with no loss of statistical rigor. Notably, our empirical results show that competing policies can be separated more quickly when using fine-grained task progress than binary success metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes