LGAIMar 22

Reward Sharpness-Aware Fine-Tuning for Diffusion Models

arXiv:2603.2117563.9h-index: 7
AI Analysis

This addresses a specific vulnerability in aligning diffusion models with human preferences, offering an incremental improvement to enhance reliability in RDRL applications.

The paper tackles reward hacking in reward-centric diffusion reinforcement learning (RDRL), where reward scores increase without perceptual quality improvements, by introducing RSA-FT, a framework that uses gradients from a robustified reward model to mitigate this issue, empirically showing improved robustness and reliability.

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes