LGOct 11, 2023

Score Regularized Policy Optimization through Diffusion Behavior

arXiv:2310.07297v363 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses computational bottlenecks for researchers and practitioners using diffusion models in offline RL, though it is an incremental improvement on existing methods.

The paper tackles the slow sampling problem in diffusion-based offline reinforcement learning by extracting a deterministic inference policy from critic models and pretrained diffusion behavior models, achieving over 25x faster action sampling while maintaining state-of-the-art performance on D4RL locomotion tasks.

Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes