ROCVDec 3, 2025

RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

arXiv:2512.03556v12 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the problem of limited generalization in robotics training for researchers and practitioners, though it appears incremental as it builds on existing world model concepts.

The paper tackles the challenge of training generalizable embodied policies by proposing RoboScape-R, a framework that uses world models to generate endogenous rewards for reinforcement learning, resulting in a 37.5% average performance improvement over baselines in out-of-domain scenarios.

Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ''endogenous'' rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes