Prompt-aware classifier free guidance for diffusion models
This work addresses a specific bottleneck in diffusion models for image and audio generation, offering a training-free enhancement that is incremental but practical for improving output quality.
The paper tackles the problem of fixed guidance scales in diffusion models failing to generalize across prompts, leading to oversaturation or weak alignment, by introducing a prompt-aware framework that predicts and selects optimal guidance scales, resulting in consistent improvements in fidelity, alignment, and perceptual preference on datasets like MSCOCO 2014 and AudioCaps.
Diffusion models have achieved remarkable progress in image and audio generation, largely due to Classifier-Free Guidance. However, the choice of guidance scale remains underexplored: a fixed scale often fails to generalize across prompts of varying complexity, leading to oversaturation or weak alignment. We address this gap by introducing a prompt-aware framework that predicts scale-dependent quality and selects the optimal guidance at inference. Specifically, we construct a large synthetic dataset by generating samples under multiple scales and scoring them with reliable evaluation metrics. A lightweight predictor, conditioned on semantic embeddings and linguistic complexity, estimates multi-metric quality curves and determines the best scale via a utility function with regularization. Experiments on MSCOCO~2014 and AudioCaps show consistent improvements over vanilla CFG, enhancing fidelity, alignment, and perceptual preference. This work demonstrates that prompt-aware scale selection provides an effective, training-free enhancement for pretrained diffusion backbones.