CV AIFeb 20

Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers

arXiv:2602.18022v16.95 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses a critical requirement for diffusion-based image editing models, offering a training-free method to enhance editing precision, though it is incremental as it builds on existing attention manipulation techniques.

The paper tackled the problem of training-free control over editing intensity in diffusion-based image editing models by proposing Dual-Channel Attention Guidance (DCAG), which manipulates both Key and Value channels in Diffusion Transformers, resulting in improved performance such as a 4.9% LPIPS reduction for object deletion and 3.2% for object addition on the PIE-Bench benchmark.

Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).

View on arXiv PDF

Similar