LGDec 2, 2025

GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An

arXiv:2512.02581v19.44 citationsh-index: 2

Originality Highly original

AI Analysis

This addresses the problem of unstable training for multimodal generative policies in online RL, offering a practical solution for complex control tasks.

The paper tackles the trade-off between stability and expressiveness in online reinforcement learning by introducing GoRL, a framework that decouples optimization from generation, using a latent policy and conditional generative decoder. It achieves a normalized return above 870 on the HopperStand task, more than three times that of the strongest baseline.

Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.

View on arXiv PDF

Similar