CVApr 29, 2025

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

arXiv:2504.20690v3161 citationsh-index: 26
Originality Incremental advance
AI Analysis

This addresses the problem of high computational costs and data requirements for precise image editing, offering a more efficient solution for users in computer vision and AI applications, though it appears incremental as it builds on existing diffusion transformer frameworks.

The paper tackles the precision-efficiency tradeoff in instruction-based image editing by proposing ICEdit, which achieves state-of-the-art performance with only 0.1% of the training data and 1% trainable parameters compared to previous methods.

Instruction-based image editing enables precise modifications via natural language prompts, but existing methods face a precision-efficiency tradeoff: fine-tuning demands massive datasets (>10M) and computational resources, while training-free approaches suffer from weak instruction comprehension. We address this by proposing ICEdit, which leverages the inherent comprehension and generation abilities of large-scale Diffusion Transformers (DiTs) through three key innovations: (1) An in-context editing paradigm without architectural modifications; (2) Minimal parameter-efficient fine-tuning for quality improvement; (3) Early Filter Inference-Time Scaling, which uses VLMs to select high-quality noise samples for efficiency. Experiments show that ICEdit achieves state-of-the-art editing performance with only 0.1\% of the training data and 1\% trainable parameters compared to previous methods. Our approach establishes a new paradigm for balancing precision and efficiency in instructional image editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes