CVAILGMay 24, 2023

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

arXiv:2305.15296v330 citations
Originality Incremental advance
AI Analysis

This addresses the problem of generating images from nuanced concepts for users of text-to-image models, but it is incremental as it builds on existing pre-trained models.

The paper tackles the difficulty of expressing complex ideas in text for image generation by proposing MultiFusion, which fuses pre-trained models to allow multilingual, interleaved multimodal inputs, enabling the image generation module to handle such inputs despite being trained on monomodal, single-language data.

The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes