Konstantin A. Neusypin

1paper

1 Paper

LGSep 2, 2024Code
Enhancing Sample Efficiency and Exploration in Reinforcement Learning through the Integration of Diffusion Models and Proximal Policy Optimization

Tianci Gao, Konstantin A. Neusypin, Dmitry D. Dmitriev et al.

Proximal Policy Optimization (PPO) is widely used in continuous control due to its robustness and stable training, yet it remains sample-inefficient in tasks with expensive interactions and high-dimensional action spaces. This paper proposes PPO-DAP (PPO with Diffusion Action Prior), a strictly on-policy framework that improves exploration quality and learning efficiency without modifying the PPO objective. PPO-DAP follows a two-stage protocol. Offline, we pretrain a conditional diffusion action prior on logged trajectories to cover the action distribution supported by the behavior policy. Online, PPO updates the actor-critic only using newly collected on-policy rollouts, while the diffusion prior is adapted around the on-policy state distribution via parameter-efficient tuning (Adapter/LoRA) over a small parameter subset. For each on-policy state, the prior generates multiple action proposals and concentrates them toward high-value regions using critic-based energy reweighting and in-denoising gradient guidance. These proposals affect the actor only through a low-weight imitation loss and an optional soft KL regularizer to the prior; importantly, PPO gradients are never backpropagated through offline logs or purely synthetic trajectories. We further analyze the method from a dual-proximal perspective and derive a one-step performance lower bound. Across eight MuJoCo continuous-control tasks under a unified online budget of 1.0M environment steps, PPO-DAP consistently improves early learning efficiency (area under the learning curve over the first 40 epochs, ALC@40) and matches or exceeds the strongest on-policy baselines in final return on 6/8 tasks, with modest overhead (1.18+/-0.04x wall-clock time and 1.05+/-0.02x peak GPU memory relative to PPO).