CVMar 25

HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models

arXiv:2603.2404337.2h-index: 6
AI Analysis

This addresses the problem of balancing style and content in image generation for users of diffusion models, offering an incremental improvement over existing methods.

The paper tackles the style-content trade-off in diffusion-based style transfer by proposing HAM, a training-free method that uses heterogeneous attention modulation to preserve content identity while capturing complex style references, achieving state-of-the-art performance on multiple metrics.

Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models' robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, these methods often fail to capture complex style reference or retain the identity of user-provided content images, thus falling into the trap of style-content balance. Thus, we propose a training-free style transfer approach via $\textbf{h}$eterogeneous $\textbf{a}$ttention $\textbf{m}$odulation ($\textbf{HAM}$) to protect identity information during image/text-guided style reference transfer, thereby addressing the style-content trade-off challenge. Specifically, we first introduces style noise initialization to initialize latent noise for diffusion. Then, during the diffusion process, it innovatively employs HAM for different attention mechanisms, including Global Attention Regulation (GAR) and Local Attention Transplantation (LAT), which better preserving the details of the content image while capturing complex style references. Our approach is validated through a series of qualitative and quantitative experiments, achieving state-of-the-art performance on multiple quantitative metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes