RO AIMay 12

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

Hao Wang, Joshua Bowden, Colton Crosby, Somil Bansal

arXiv:2605.1147953.1

Predicted impact top 41% in RO · last 90 daysOriginality Incremental advance

AI Analysis

For roboticists evaluating manipulation policies with sparse rewards, this method provides a more robust evaluation tool that mitigates finite-horizon truncation bias.

The paper proposes a liveness-based Bellman operator for offline policy evaluation in robotic manipulation, which reduces truncation bias from finite-length rollouts and more accurately reflects task progress, outperforming TD(0) and Monte Carlo methods on simulated and real tasks.

Policy evaluation is a fundamental component of the development and deployment pipeline for robotic policies. In modern manipulation systems, this problem is particularly challenging: rewards are often sparse, task progression of evaluation rollouts are often non-monotonic as the policies exhibit recovery behaviors, and evaluation rollouts are necessarily of finite length. This finite length introduces truncation bias, breaking the infinite-horizon assumptions underlying standard methods relying on Bellman equations/principle of optimality. In this work, we propose a framework for offline policy evaluation from sparse rewards based on a liveness-based Bellman operator. Our formulation interprets policy evaluation as a task-completion problem and yields a conservative fixed-point value function that is robust to finite-horizon truncation. We analyze the theoretical properties of the proposed operator, including contraction guarantees, and show how it encodes task progression while mitigating truncation bias. We evaluate our method on two simulated manipulation tasks using both a Vision-Language-Action model and a diffusion policy, and a cloth folding task using human demonstrations. Empirical results demonstrate that our approach more accurately reflects task progress and substantially reduces truncation bias, outperforming classical baselines such as TD(0) and Monte Carlo policy evaluation.

View on arXiv PDF

Similar