LG AIMay 4

A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

Rufeng Chen, Zhaofan Zhang, Zhejiang Yang, Hechang Chen, Sihong Xie

arXiv:2605.0277746.7

Predicted impact top 54% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For offline safe reinforcement learning, this work addresses the challenge of adapting to changing safety budgets at deployment time, offering a method that reliably satisfies constraints while maximizing reward.

The paper proposes Safe Decoupled Guidance Diffusion (SDGD), a diffusion-based planner for offline safe RL that adapts to varying cost limits by using cost-conditioned generation for safety and reward gradients for performance. SDGD achieves the strongest safety compliance among baselines, satisfying constraints on 94.7% of tasks (36/38) and obtaining the highest reward among safe methods on 21 tasks.

Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and constraint satisfaction as competing gradient objectives, which can lead to unreliable safety compliance under cost limits. We reinterpret adaptive safe trajectory generation as sampling from a constrained trajectory distribution, where the budget restricts the trajectory region, and reward shapes preferences within that region. This perspective motivates Safe Decoupled Guidance Diffusion (SDGD), which conditions classifier-free guidance on the cost limit to bias sampling toward trajectories satisfying the specified limit, while using reward-gradient guidance to refine trajectories for higher return. Because direct reward guidance can increase return while also steering samples toward trajectories with higher cumulative cost, we introduce Feasible Trajectory Relabeling (FTR) to reshape reward targets and discourage such directions. We further provide a first-order sampling-time analysis showing that FTR suppresses reward-induced cost drift under a prefix-restorative alignment condition. Extensive evaluations on the DSRL benchmark show that SDGD achieves the strongest safety compliance among baselines, satisfying the constraint on 94.7% of tasks (36/38), while obtaining the highest reward among safe methods on 21 tasks.

View on arXiv PDF

Similar