MLAICLLGFeb 3, 2025

Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods

arXiv:2502.01384v235 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

This work addresses a specific problem in machine learning for researchers and practitioners working with discrete diffusion models and reinforcement learning, representing an incremental advancement in fine-tuning techniques.

The paper tackles the challenge of fine-tuning discrete diffusion models using policy gradient methods for non-differentiable rewards, proposing Score Entropy Policy Optimization (SEPO) and demonstrating its scalability and efficiency across multiple discrete generative tasks.

Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes