CVJun 9, 2025

Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong

arXiv:2506.07986v320.415 citationsh-index: 17Has Code

Originality Incremental advance

AI Analysis

This addresses a key limitation in text-to-image generation models for users needing high semantic fidelity, though it is incremental as it builds on existing models like FLUX and SD3.5.

The paper tackled the problem of imprecise text-image alignment in Multimodal Diffusion Transformers by proposing Temperature-Adjusted Cross-modal Attention (TACA), which improved alignment on benchmarks like T2I-CompBench with minimal computational overhead.

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA}

View on arXiv PDF Code

Similar