Latent Policy Steering through One-Step Flow Policies
This work provides a more robust and high-fidelity method for offline reinforcement learning, which is crucial for safely training robots from pre-recorded data without risky exploration.
This paper addresses the challenge in offline reinforcement learning (RL) where policies often stray outside the dataset support. The authors propose Latent Policy Steering (LPS), which uses a differentiable one-step MeanFlow policy to backpropagate original-action-space Q-gradients to update a latent-action-space actor, achieving state-of-the-art performance on OGBench and real-world robotic tasks.
Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.