Preference-Based Alignment of Discrete Diffusion Models
This work addresses a practical problem for researchers and practitioners using discrete diffusion models in domains like language modeling and protein sequence generation, though it appears incremental as it adapts an existing method to a new model type.
The paper tackles the challenge of aligning discrete diffusion models with task-specific preferences when explicit reward functions are unavailable, by introducing Discrete Diffusion DPO (D2-DPO), which adapts Direct Preference Optimization to discrete diffusion models and demonstrates effective alignment on a structured binary sequence generation task.
Diffusion models have achieved state-of-the-art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task-specific preferences remains challenging, particularly in scenarios where explicit reward functions are unavailable. In this work, we introduce Discrete Diffusion DPO (D2-DPO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains. Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution. We validate D2-DPO on a structured binary sequence generation task, demonstrating that the method effectively aligns model outputs with preferences while maintaining structural validity. Our results highlight that D2-DPO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches. Future research will explore extending D2-DPO to more complex generative tasks, including language modeling and protein sequence generation, as well as investigating alternative noise schedules, such as uniform noising, to enhance flexibility across different applications.