ROMay 21

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

Tengye Xu, Yangting Sun, Ziju Shen, Guanqi Chen, Zhen Fu, Chen yizhou, Hua Chen, Jia Pan

arXiv:2605.2212376.7

AI Analysis

For roboticists needing reward functions that work outside lab conditions, this work provides a practical method to learn generalizable rewards from very few demonstrations, addressing a key bottleneck in deploying RL in open-world settings.

The paper tackles the problem of reward function generalization in real-world robotics, proposing a framework that learns invariant symbolic rewards from as few as five demonstrations. The method achieves stronger process alignment and policy rollout ranking than baselines on eight simulated and three real-world tasks, and generalizes zero-shot to variations in position, viewpoint, and object.

Designing reward functions that generalize beyond controlled laboratory settings remains a fundamental challenge in reinforcement learning for robotics. In open-world manipulation problems, a single task can appear in numerous variants through different object instances, positions, and camera viewpoints. Recent vision-based reward models tend to memorize specific pixel distributions and fail to generalize beyond their training conditions. To address this, we propose a framework that learns invariant symbolic reward functions from as few as five demonstrations. The insight is to shift from visual feature-fitting to the discovery of behavioral invariants: task-level properties that remain constant across diverse visual instantiations. The framework has two coupled components: a structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical procedure that distills these invariants from demonstrations without online interaction. Experiments on eight Meta-World tasks and three Franka manipulation tasks demonstrate that our method achieves stronger process alignment and policy rollout ranking abilities compared to baselines, accelerating downstream policy learning. Three real-world out-of-distribution experiments further show that the same learned reward generalizes zero-shot to position, viewpoint, and object variations, enabling a single reward representation to be reused across diverse task variants in practice.

View on arXiv PDF

Similar