CVJan 3, 2024

aMUSEd: An Open MUSE Reproduction

Suraj Patil, William Berman, Robin Rombach, Patrick von Platen

arXiv:2401.01808v121.830 citationsh-index: 15Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the need for efficient and interpretable text-to-image generation models, though it is incremental as it reproduces and adapts the existing MUSE approach.

The authors tackled the problem of text-to-image generation by developing aMUSEd, an open-source masked image model (MIM) based on MUSE, which achieves fast image generation with only 10% of MUSE's parameters and directly produces images at 256x256 and 512x512 resolutions.

We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.

View on arXiv PDF Code

Similar