RO CVMar 2

$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

Siting Wang, Xiaofeng Wang, Zheng Zhu, Minnan Pei, Xinyu Cui, Cheng Deng, Jian Zhao, Guan Huang, Haifeng Zhang, Jun Wang

arXiv:2603.02083v12.2h-index: 18

Originality Incremental advance

AI Analysis

This addresses a scalability bottleneck for flow-based VLAs in complex real-world applications, though it appears incremental as a fine-tuning framework.

The paper tackles the problem of intractable likelihoods during multi-step sampling in flow-based vision-language-action models for online reinforcement learning, proposing π-StepNFT, which eliminates auxiliary value networks and requires only a single forward pass per step. It achieves competitive few-shot robustness on LIBERO and superior generalization on ManiSkill, outperforming value-based baselines in out-of-distribution scenarios.

Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf{\textit{$\boldsymbolπ$-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, $π$-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.

View on arXiv PDF

Similar