CVCLMay 19, 2023

Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with Images as Pivots

arXiv:2305.11540v14 citations
Originality Incremental advance
AI Analysis

This addresses the problem of making large-scale text-to-image synthesis accessible for Chinese users without extensive resources, though it is incremental as it adapts an existing model rather than creating a new paradigm.

The paper tackles the challenge of high computational costs and data requirements for training text-to-image diffusion models in non-English languages by proposing IAP, a method to transfer English Stable Diffusion to Chinese, which outperforms strong Chinese diffusion models using only 5%~10% of the training data.

Diffusion models have made impressive progress in text-to-image synthesis. However, training such large-scale models (e.g. Stable Diffusion), from scratch requires high computational costs and massive high-quality text-image pairs, which becomes unaffordable in other languages. To handle this challenge, we propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese. IAP optimizes only a separate Chinese text encoder with all other parameters fixed to align Chinese semantics space to the English one in CLIP. To achieve this, we innovatively treat images as pivots and minimize the distance of attentive features produced from cross-attention between images and each language respectively. In this way, IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently, advancing the quality of the generated image with direct Chinese prompts. Experimental results show that our method outperforms several strong Chinese diffusion models with only 5%~10% training data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes