LGROOct 11, 2025

Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models

arXiv:2510.09976v14 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses a computational bottleneck for researchers and practitioners in robotics and AI, enabling more efficient online reinforcement fine-tuning of large-scale models, though it is incremental as it builds on existing flow-matching methods.

The paper tackles the challenge of fine-tuning Vision-Language-Action models with reinforcement learning by proposing the Flow Policy Optimization algorithm, which overcomes computational intractability in flow-matching policies and achieves stable improvements on benchmarks like LIBERO and ALOHA over imitation and other baselines.

Vision-Language-Action (VLA) models such as OpenVLA, Octo, and $π_0$ have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) provides a promising path for improving and fine-tuning VLAs through online interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the $π_0$ model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL, and $π_0$-FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and the stable convergence of the conditional flow-matching objective during online RL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes