CVFeb 17, 2023

Fine-grained Cross-modal Fusion based Refinement for Text-to-Image Synthesis

arXiv:2302.08706v213 citationsh-index: 23Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of semantic inconsistency in text-to-image generation for applications like content creation, though it is incremental over existing refinement-based methods.

The paper tackles the problem of generating text-mismatched images in text-to-image synthesis, especially with complex descriptions, by proposing FF-GAN, which improves semantic consistency and detail refinement, achieving state-of-the-art results on CUB-200 and COCO datasets.

Text-to-image synthesis refers to generating visual-realistic and semantically consistent images from given textual descriptions. Previous approaches generate an initial low-resolution image and then refine it to be high-resolution. Despite the remarkable progress, these methods are limited in fully utilizing the given texts and could generate text-mismatched images, especially when the text description is complex. We propose a novel Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed FF-GAN, which consists of two modules: Fine-grained text-image Fusion Block (FF-Block) and Global Semantic Refinement (GSR). The proposed FF-Block integrates an attention block and several convolution layers to effectively fuse the fine-grained word-context features into the corresponding visual features, in which the text information is fully used to refine the initial image with more details. And the GSR is proposed to improve the global semantic consistency between linguistic and visual features during the refinement process. Extensive experiments on CUB-200 and COCO datasets demonstrate the superiority of FF-GAN over other state-of-the-art approaches in generating images with semantic consistency to the given texts.Code is available at https://github.com/haoranhfut/FF-GAN.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes