CVOct 31, 2025

Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

Yijia Wang, Yiqing Shen, Weiming Chen, Zhihai He

arXiv:2510.27335v13.6Has Code

Originality Highly original

AI Analysis

This addresses the high computational cost of fine-tuning models for complex image editing, offering a more efficient solution for users in image processing.

The paper tackles the problem of complex image editing by converting user instructions into simple actions using LLM reasoning, eliminating the need for joint fine-tuning of LLMs and diffusion models, and achieves a 9.955 dB improvement in PSNR over previous methods.

Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \href{https://github.com/Jia-shao/Reasoning-Editing}{https://github.com/Jia-shao/Reasoning-Editing}.

View on arXiv PDF Code

Similar