See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation
This addresses the problem of failure recovery in robotic manipulation for tasks with unseen instructions and states, though it appears incremental as it builds on existing vision-language-action models.
The paper tackles robust robotic manipulation by introducing See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into spatial subgoals and enables error correction through monitoring and rewinding, resulting in a 5% improvement over MolmoAct on the LIBERO benchmark and state-of-the-art robustness on LIBERO-Plus.
Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework's effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5\% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.