CVAINov 23, 2024

Gradient-Free Classifier Guidance for Diffusion Model Sampling

arXiv:2411.15393v19 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses efficiency and fidelity issues in diffusion model sampling for image generation, offering a method that is complementary to existing techniques and achieves state-of-the-art results, though it is incremental in nature.

The paper tackles the trade-off between computational cost and class alignment in guided sampling for diffusion models by proposing Gradient-free Classifier Guidance (GFCG), which improves class prediction accuracy and achieves a record FDDINOv2 score of 23.09 on ImageNet 512×512 with 94.3% classification precision.

Image generation using diffusion models have demonstrated outstanding learning capabilities, effectively capturing the full distribution of the training dataset. They are known to generate wide variations in sampled images, albeit with a trade-off in image fidelity. Guided sampling methods, such as classifier guidance (CG) and classifier-free guidance (CFG), focus sampling in well-learned high-probability regions to generate images of high fidelity, but each has its limitations. CG is computationally expensive due to the use of back-propagation for classifier gradient descent, while CFG, being gradient-free, is more efficient but compromises class label alignment compared to CG. In this work, we propose an efficient guidance method that fully utilizes a pre-trained classifier without using gradient descent. By using the classifier solely in inference mode, a time-adaptive reference class label and corresponding guidance scale are determined at each time step for guided sampling. Experiments on both class-conditioned and text-to-image generation diffusion models demonstrate that the proposed Gradient-free Classifier Guidance (GFCG) method consistently improves class prediction accuracy. We also show GFCG to be complementary to other guided sampling methods like CFG. When combined with the state-of-the-art Autoguidance (ATG), without additional computational overhead, it enhances image fidelity while preserving diversity. For ImageNet 512$\times$512, we achieve a record $\text{FD}_{\text{DINOv2}}$ of 23.09, while simultaneously attaining a higher classification Precision (94.3%) compared to ATG (90.2%)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes