Feat2Go: Visual Feature-Grounded Value Estimation for Embodied Reinforcement Learning
This work is significant for researchers and practitioners working on embodied reinforcement learning, as it offers a method to improve the effectiveness of VLA models by mitigating sparse supervision and the need for manual reward engineering, which is a common bottleneck in long-horizon manipulation tasks.
This paper addresses the challenge of sparse supervision and reward design in embodied reinforcement learning for vision-language-action (VLA) models. The authors introduce Feat2Go, a framework that estimates fine-grained value by deriving continuous progress targets from a visual world model and using these to reshape terminal rewards. This approach significantly improved OpenVLAOFT's out-of-distribution success on ManiSkill3 from 17.5% to 82.9% and achieved an 88.8% success rate on RoboTwin 2.0.
Reinforcement learning is a promising approach for improving the capabilities of vision-language-action (VLA) models while avoiding the heavy data requirements of imitation learning. However, its effectiveness for VLA models is often constrained by sparse supervision and the difficulty of designing informative reward signals for long-horizon manipulation. In this work, we present Feat2Go, a fine-grained value estimation framework for embodied reinforcement learning. Specifically, Feat2Go first derives a continuous progress target from a pretrained visual world model by measuring patch-level similarity to subgoal states and partitioning episodes into semantic stages with trend-based clustering. We then train an embodied value model to predict this structural progress from the current observation and task instruction, and use the predicted value to reshape terminal rewards during policy optimization. The proposed framework is compatible with existing VLA policy reinforcement learning pipelines, including PPO and GRPO, and does not rely on manual reward engineering. Extensive experiments on ManiSkill3 and RoboTwin 2.0 demonstrate that Feat2Go consistently improves the performance of existing VLA models under both single-arm and bimanual manipulation settings. More specifically, on ManiSkill3, Feat2Go improves OpenVLAOFT from 17.5% to 82.9% average out-of-distribution success while retaining 96.9% in-distribution performance. On RoboTwin 2.0, Feat2Go achieves an average success rate of 88.8% in domain-randomized task settings, outperforming prior reinforcement learning methods.