CVAILGMar 27, 2023

Text-to-Image Diffusion Models are Zero-Shot Classifiers

Stanford
arXiv:2303.15233v2171 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses the gap in exploring diffusion models for downstream tasks, offering a novel evaluation method that could influence vision-language model design, though it is incremental in applying existing models to classification.

The authors tackled the problem of understanding the knowledge captured by text-to-image diffusion models by proposing a method to evaluate them as zero-shot classifiers, using denoising ability as a proxy for label likelihood, and found they perform competitively with CLIP on classification datasets, achieve state-of-the-art results on shape/texture bias tests, and successfully perform attribute binding.

The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge and comparing them with CLIP's zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision-language tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes