CVAIJun 26, 2024

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

arXiv:2406.18790v25 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of multimodal image generation for applications like style transfer and character consistency, representing an incremental advancement by leveraging existing text-to-image data.

The paper tackles the problem of generating images from multimodal prompts that interleave text and images, by bootstrapping a dataset from text-image data and training a model that composes inputs from different images into coherent outputs, such as transferring a realistic person to a cartoon style or placing a subject on a scooter.

We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes