CVAug 18, 2023

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Runhui Huang, Jianhua Han, Guansong Lu, Xiaodan Liang, Yihan Zeng, Wei Zhang, Hang Xu

arXiv:2308.09306v16.810 citationsh-index: 30

Originality Incremental advance

AI Analysis

This work addresses the challenge of integrating generation and discrimination in multimodal AI, offering a unified approach that could benefit applications like content creation and retrieval, though it is incremental as it builds on existing diffusion and cross-modal models.

The paper tackles the problem of unifying generative and discriminative cross-modal learning by proposing DiffDis, a diffusion-based framework that jointly models image generation and image-text discrimination, resulting in a 1.65% improvement in zero-shot classification accuracy and a 2.42 improvement in FID for image synthesis.

Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2, have shown remarkable results on image synthesis. On the other hand, large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are competent for various downstream tasks by learning to align vision and language embeddings. In this paper, we explore the possibility of jointly modeling generation and discrimination. Specifically, we propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. DiffDis first formulates the image-text discriminative problem as a generative diffusion process of the text embedding from the text encoder conditioned on the image. Then, we propose a novel dual-stream network architecture, which fuses the noisy text embedding with the knowledge of latent images from different scales for image-text discriminative learning. Moreover, the generative and discriminative tasks can efficiently share the image-branch network structure in the multi-modality model. Benefiting from diffusion-based unified training, DiffDis achieves both better generation ability and cross-modal semantic alignment in one architecture. Experimental results show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks, e.g., 1.65% improvement on average accuracy of zero-shot classification over 12 datasets and 2.42 improvement on FID of zero-shot image synthesis.

View on arXiv PDF

Similar