LGMay 19, 2025

Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL

arXiv:2506.03154v12 citationsh-index: 1
Originality Highly original
AI Analysis

This work addresses inefficiencies in offline RL training pipelines for researchers and practitioners, offering a modular approach that is incremental but improves upon existing methods.

The paper tackles the suboptimal joint training of guidance and diffusion models in offline reinforcement learning by proposing modular training methods that decouple and recombine them, resulting in reduced memory usage, improved computational efficiency, and enhanced performance, such as an 86% reduction in normalized score variance.

Classifier free guidance has shown strong potential in diffusion-based reinforcement learning. However, existing methods rely on joint training of the guidance module and the diffusion model, which can be suboptimal during the early stages when the guidance is inaccurate and provides noisy learning signals. In offline RL, guidance depends solely on offline data: observations, actions, and rewards, and is independent of the policy module's behavior, suggesting that joint training is not required. This paper proposes modular training methods that decouple the guidance module from the diffusion model, based on three key findings: Guidance Necessity: We explore how the effectiveness of guidance varies with the training stage and algorithm choice, uncovering the roles of guidance and diffusion. A lack of good guidance in the early stage presents an opportunity for optimization. Guidance-First Diffusion Training: We introduce a method where the guidance module is first trained independently as a value estimator, then frozen to guide the diffusion model using classifier-free reward guidance. This modularization reduces memory usage, improves computational efficiency, and enhances both sample efficiency and final performance. Cross-Module Transferability: Applying two independently trained guidance models, one during training and the other during inference, can significantly reduce normalized score variance (e.g., reducing IQR by 86%). We show that guidance modules trained with one algorithm (e.g., IDQL) can be directly reused with another (e.g., DQL), with no additional training required, demonstrating baseline-level performance as well as strong modularity and transferability. We provide theoretical justification and empirical validation on bullet D4RL benchmarks. Our findings suggest a new paradigm for offline RL: modular, reusable, and composable training pipelines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes