CV AI LGMar 11, 2025

Aligning Text to Image in Diffusion Models is Easier Than You Think

Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, Jong Chul Ye

arXiv:2503.08250v422.815 citationsh-index: 13Has Code

Originality Incremental advance

AI Analysis

This addresses text-image alignment issues in generative models, offering an efficient solution for researchers and practitioners, though it is incremental as it builds on existing representation alignment methods.

The paper tackles residual misalignment between text and image representations in diffusion models by proposing SoftREPA, a lightweight contrastive fine-tuning strategy that improves semantic consistency with fewer than 1M trainable parameters.

While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Some approaches address this issue by fine-tuning models in terms of preference optimization, etc., which require tailored datasets. Orthogonal to these methods, we revisit the challenge from the perspective of representation alignment-an approach that has gained popularity with the success of REPresentation Alignment (REPA). We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment. Instead, a better alignment can be achieved through contrastive learning that leverages existing dataset as both positive and negative pairs. To enable efficient alignment with pretrained models, we propose SoftREPA- a lightweight contrastive fine-tuning strategy that leverages soft text tokens for representation alignment. This approach improves alignment with minimal computational overhead by adding fewer than 1M trainable parameters to the pretrained model. Our theoretical analysis demonstrates that our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency. Experimental results across text-to-image generation and text-guided image editing tasks validate the effectiveness of our approach in improving the semantic consistency of T2I generative models.

View on arXiv PDF Code

Similar