MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
This provides a cost-effective solution for multilingual image generation, making it accessible for applications in diverse linguistic contexts, though it is incremental as it builds on existing diffusion models and text encoders.
The paper tackles the problem of high-cost multilingual image generation by introducing MuLan, a lightweight adapter that enables text-to-image generation in over 110 languages with minimal training cost, achieving CLIP similarity scores of 39.61 for other languages compared to 39.57 for English.
In this work, we explore a cost-effective framework for multilingual image generation. We find that, unlike models tuned on high-quality images with multilingual annotations, leveraging text encoders pre-trained on widely available, noisy Internet image-text pairs significantly enhances data efficiency in text-to-image (T2I) generation across multiple languages.Based on this insight, we introduce MuLan, Multi-Language adapter, a lightweight language adapter with fewer than 20M parameters, trained alongside a frozen text encoder and image diffusion model. Compared to previous multilingual T2I models, this framework offers: (1) Cost efficiency. Using readily accessible English data and off-the-shelf multilingual text encoders minimizes the training cost; (2) High performance. Achieving comparable generation capabilities in over 110 languages with CLIP similarity scores nearly matching those in English (39.57 for English vs. 39.61 for other languages); and (3) Broad applicability. Seamlessly integrating with compatible community tools like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.