Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators
This addresses the problem of costly and statistically unreliable robot policy evaluation for robotics researchers, representing an incremental improvement over existing evaluation methods.
The paper tackles the challenge of rigorous robot policy evaluation by presenting SureSim, a framework that combines small-scale real-world testing with large-scale simulation to provide reliable performance inferences. Their approach saves 20-25% of hardware evaluation effort while achieving similar performance bounds.
Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned \(π_0\) on a joint distribution of objects and initial conditions, and find that our approach saves over \(20-25\%\) of hardware evaluation effort to achieve similar bounds on policy performance.