CVOct 27, 2022

SSD: Towards Better Text-Image Consistency Metric in Text-to-Image Generation

Zhaorui Tan, Xi Yang, Zihan Ye, Qiufeng Wang, Yuyao Yan, Anh Nguyen, Kaizhu Huang

arXiv:2210.15235v33.73 citationsh-index: 40Has Code

Originality Incremental advance

AI Analysis

This addresses a key bottleneck in text-to-image generation for applications requiring reliable semantic alignment, though it is incremental as it builds on existing CLIP and GAN frameworks.

The paper tackles the problem of inaccurate text-image consistency metrics in GAN-based text-to-image generation by proposing a novel CLIP-based metric called Semantic Similarity Distance (SSD) and a new model, PDF-GAN, which improves consistency while maintaining image quality on CUB and COCO datasets.

Generating consistent and high-quality images from given texts is essential for visual-language understanding. Although impressive results have been achieved in generating high-quality images, text-image consistency is still a major concern in existing GAN-based methods. Particularly, the most popular metric $R$-precision may not accurately reflect the text-image consistency, often resulting in very misleading semantics in the generated images. Albeit its significance, how to design a better text-image consistency metric surprisingly remains under-explored in the community. In this paper, we make a further step forward to develop a novel CLIP-based metric termed as Semantic Similarity Distance ($SSD$), which is both theoretically founded from a distributional viewpoint and empirically verified on benchmark datasets. Benefiting from the proposed metric, we further design the Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN) that aims at improving text-image consistency by fusing semantic information at different granularities and capturing accurate semantics. Equipped with two novel plug-and-play components: Hard-Negative Sentence Constructor and Semantic Projection, the proposed PDF-GAN can mitigate inconsistent semantics and bridge the text-image semantic gap. A series of experiments show that, as opposed to current state-of-the-art methods, our PDF-GAN can lead to significantly better text-image consistency while maintaining decent image quality on the CUB and COCO datasets.

View on arXiv PDF Code

Similar