GRAICVAug 12, 2025

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

arXiv:2508.09131v23 citationsh-index: 32
Originality Incremental advance
AI Analysis

This addresses a fundamental but unsolved problem in image and video editing for users needing fine-grained color control without training, though it appears incremental as it builds on existing multi-modal diffusion transformers.

The paper tackles the problem of text-guided color editing in images and videos, which requires precise manipulation of color attributes while preserving physical consistency, and presents ColorCtrl, a training-free method that achieves state-of-the-art performance in edit quality and consistency, outperforming existing approaches and commercial models like FLUX.1 Kontext Max and GPT-4o.

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes