CVJul 22, 2025

Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling

arXiv:2507.16240v13 citationsh-index: 9Has Code
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in multimodal AI for image generation and editing, making models more reliable for users, but it is incremental as it builds on existing unified frameworks.

The paper tackles the problem of text instruction neglect in unified image generation models, especially with multiple sub-instructions, by proposing Self-Adaptive Attention Scaling (SaaS), which improves instruction-following fidelity without extra training or optimization, showing superior results in experiments.

Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework, accepting multimodal, interleaved texts and images in free form. This unified architecture eliminates the need for text encoders, greatly reducing model complexity and standardizing various image generation and editing tasks, making it more user-friendly. However, we found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions. To explore this issue, we performed a perturbation analysis on the input to identify critical steps and layers. By examining the cross-attention maps of these key steps, we observed significant conflicts between neglected sub-instructions and the activations of the input image. In response, we propose Self-Adaptive Attention Scaling (SaaS), a method that leverages the consistency of cross-attention between adjacent timesteps to dynamically scale the attention activation for each sub-instruction. Our SaaS enhances instruction-following fidelity without requiring additional training or test-time optimization. Experimental results on instruction-based image editing and visual conditional image generation validate the effectiveness of our SaaS, showing superior instruction-following fidelity over existing methods. The code is available https://github.com/zhouchao-ops/SaaS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes