CVAIMar 2, 2023

X&Fuse: Fusing Visual Information in Text-to-Image Generation

arXiv:2303.01000v16 citationsh-index: 53
Originality Incremental advance
AI Analysis

This addresses the challenge of enhancing image generation quality with visual cues for applications in AI and creative tools, representing a strong incremental advance with practical speed improvements.

The paper tackles the problem of incorporating visual information into text-to-image generation by introducing X&Fuse, which improves performance in scenarios like image retrieval, subject-driven generation, and scene access, achieving state-of-the-art FID scores such as 6.65 and 5.03 on MS-COCO in zero-shot settings.

We introduce X&Fuse, a general approach for conditioning on visual information when generating images from text. We demonstrate the potential of X&Fuse in three different text-to-image generation scenarios. (i) When a bank of images is available, we retrieve and condition on a related image (Retrieve&Fuse), resulting in significant improvements on the MS-COCO benchmark, gaining a state-of-the-art FID score of 6.65 in zero-shot settings. (ii) When cropped-object images are at hand, we utilize them and perform subject-driven generation (Crop&Fuse), outperforming the textual inversion method while being more than x100 faster. (iii) Having oracle access to the image scene (Scene&Fuse), allows us to achieve an FID score of 5.03 on MS-COCO in zero-shot settings. Our experiments indicate that X&Fuse is an effective, easy-to-adapt, simple, and general approach for scenarios in which the model may benefit from additional visual information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes