X&Fuse: Fusing Visual Information in Text-to-Image Generation
This addresses the challenge of enhancing image generation quality with visual cues for applications in AI and creative tools, representing a strong incremental advance with practical speed improvements.
The paper tackles the problem of incorporating visual information into text-to-image generation by introducing X&Fuse, which improves performance in scenarios like image retrieval, subject-driven generation, and scene access, achieving state-of-the-art FID scores such as 6.65 and 5.03 on MS-COCO in zero-shot settings.
We introduce X&Fuse, a general approach for conditioning on visual information when generating images from text. We demonstrate the potential of X&Fuse in three different text-to-image generation scenarios. (i) When a bank of images is available, we retrieve and condition on a related image (Retrieve&Fuse), resulting in significant improvements on the MS-COCO benchmark, gaining a state-of-the-art FID score of 6.65 in zero-shot settings. (ii) When cropped-object images are at hand, we utilize them and perform subject-driven generation (Crop&Fuse), outperforming the textual inversion method while being more than x100 faster. (iii) Having oracle access to the image scene (Scene&Fuse), allows us to achieve an FID score of 5.03 on MS-COCO in zero-shot settings. Our experiments indicate that X&Fuse is an effective, easy-to-adapt, simple, and general approach for scenarios in which the model may benefit from additional visual information.