LG ROSep 2, 2024

Enhancing Sample Efficiency and Exploration in Reinforcement Learning through the Integration of Diffusion Models and Proximal Policy Optimization

Tianci Gao, Konstantin A. Neusypin, Dmitry D. Dmitriev, Bo Yang, Shengren Rao

arXiv:2409.01427v62.62 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This work addresses sample efficiency for reinforcement learning in continuous control, though it is incremental as it builds on existing PPO and diffusion methods.

The paper tackles the sample inefficiency of Proximal Policy Optimization (PPO) in continuous control tasks by integrating a diffusion action prior, resulting in improved early learning efficiency and matching or exceeding baselines in final return on 6 out of 8 tasks with modest overhead.

Proximal Policy Optimization (PPO) is widely used in continuous control due to its robustness and stable training, yet it remains sample-inefficient in tasks with expensive interactions and high-dimensional action spaces. This paper proposes PPO-DAP (PPO with Diffusion Action Prior), a strictly on-policy framework that improves exploration quality and learning efficiency without modifying the PPO objective. PPO-DAP follows a two-stage protocol. Offline, we pretrain a conditional diffusion action prior on logged trajectories to cover the action distribution supported by the behavior policy. Online, PPO updates the actor-critic only using newly collected on-policy rollouts, while the diffusion prior is adapted around the on-policy state distribution via parameter-efficient tuning (Adapter/LoRA) over a small parameter subset. For each on-policy state, the prior generates multiple action proposals and concentrates them toward high-value regions using critic-based energy reweighting and in-denoising gradient guidance. These proposals affect the actor only through a low-weight imitation loss and an optional soft KL regularizer to the prior; importantly, PPO gradients are never backpropagated through offline logs or purely synthetic trajectories. We further analyze the method from a dual-proximal perspective and derive a one-step performance lower bound. Across eight MuJoCo continuous-control tasks under a unified online budget of 1.0M environment steps, PPO-DAP consistently improves early learning efficiency (area under the learning curve over the first 40 epochs, ALC@40) and matches or exceeds the strongest on-policy baselines in final return on 6/8 tasks, with modest overhead (1.18+/-0.04x wall-clock time and 1.05+/-0.02x peak GPU memory relative to PPO).

View on arXiv PDF Code

Similar