CVLGJun 3

Can We Predict The Human Preference For Text-to-Image Content Prior To Generation And Is It Even Useful To Do So?

arXiv:2606.0547849.0
Predicted impact top 72% in CV · last 90 daysOriginality Synthesis-oriented
AI Analysis

For practitioners deploying diffusion models locally, this work offers a method to pre-select noise seeds to improve generation quality without additional compute cost.

The paper investigates whether human preference scores for text-to-image generation can be predicted before generation, and whether such prediction can improve output quality. They find it is possible with negligible hardware overhead.

Diffusion Models (DM) have revolutionized text-driven generation by enabling the synthesis of high-quality, photorealistic visual content from user prompts. Whereas prior advances in visual generation such as VAEs and GANs were primarily evaluated on perceptual or visual similarity metrics such as FID PSNR, DM advances have fostered the development of more advanced Human Preference Metrics (HPM) that model and quantify human judgment as scalar values. However, DMs synthesize content using an inherently stochastic process where random noise seeds generation. The initial random noise directly affects the quality of generated outputs, both qualitatively and quantitatively. This influence is pronounced in smaller models for local deployment scenarios. Given this phenomenon, we first investigate to what extent we can predict scalar HPM scores prior to committing compute resources for generation. Further, we then investigate to what extent we can leverage such prediction to improve the quality of generated images, and also study which HPMs are best suited for this task. Our investigation reveals that not only is this possible, but that it is feasible to achieve negligible hardware overhead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes