Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance
This work addresses a key limitation in conditional diffusion models for generative AI applications, offering a method to enhance sample quality without sacrificing diversity, though it is incremental as it builds on existing CFG techniques.
The paper tackles the trade-off between quality and diversity in conditional diffusion models using Classifier-Free Guidance (CFG) by identifying a missing Rényi divergence term that corrects CFG to align with proper denoising diffusion models and proposing a Gibbs-like sampling method to sample from the desired distribution, resulting in substantial improvements over CFG in image and text-to-audio generation tasks.
Classifier-Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade-off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well-defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero -- where the data distribution is tilted by a power $w \gt 1$ of the conditional distribution. We identify the missing component: a Rényi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low-noise limit. Second, motivated by this insight, we propose a Gibbs-like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text-to-audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at https://github.com/yazidjanati/cfgig