LGAIJun 15, 2025

Flow-Based Policy for Online Reinforcement Learning

arXiv:2506.12811v122 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the problem of enhancing policy expressiveness in online RL for researchers and practitioners, though it is incremental as it builds on existing flow-based methods with a novel optimization approach.

The paper tackled the challenge of applying flow-based generative models to online reinforcement learning by addressing the objective mismatch between static data imitation and dynamic value-based policy optimization, resulting in a framework that achieves competitive performance on benchmarks like DMControl and Humanoidbench.

We present \textbf{FlowRL}, a novel framework for online reinforcement learning that integrates flow-based policy representation with Wasserstein-2-regularized optimization. We argue that in addition to training signals, enhancing the expressiveness of the policy class is crucial for the performance gains in RL. Flow-based generative models offer such potential, excelling at capturing complex, multimodal action distributions. However, their direct application in online RL is challenging due to a fundamental objective mismatch: standard flow training optimizes for static data imitation, while RL requires value-based policy optimization through a dynamic buffer, leading to difficult optimization landscapes. FlowRL first models policies via a state-dependent velocity field, generating actions through deterministic ODE integration from noise. We derive a constrained policy search objective that jointly maximizes Q through the flow policy while bounding the Wasserstein-2 distance to a behavior-optimal policy implicitly derived from the replay buffer. This formulation effectively aligns the flow optimization with the RL objective, enabling efficient and value-aware policy learning despite the complexity of the policy class. Empirical evaluations on DMControl and Humanoidbench demonstrate that FlowRL achieves competitive performance in online reinforcement learning benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes