CVMay 23, 2025

R-Genie: Reasoning-Guided Generative Image Editing

arXiv:2505.17768v24 citationsh-index: 4
Originality Highly original
AI Analysis

This work addresses the need for more intelligent image synthesis that comprehends implicit user intentions and contextual reasoning, representing a novel paradigm rather than an incremental improvement.

The paper tackles the problem of image editing methods being constrained by explicit textual instructions and limited operations by introducing a reasoning-guided generative editing paradigm that synthesizes images based on complex, multi-faceted textual queries. The result is R-Genie, which synergizes diffusion models with multimodal large language models, validated through experiments on a dataset of over 1,000 image-instruction-edit triples.

While recent advances in image editing have enabled impressive visual synthesis capabilities, current methods remain constrained by explicit textual instructions and limited editing operations, lacking deep comprehension of implicit user intentions and contextual reasoning. In this work, we introduce a new image editing paradigm: reasoning-guided generative editing, which synthesizes images based on complex, multi-faceted textual queries accepting world knowledge and intention inference. To facilitate this task, we first construct a comprehensive dataset featuring over 1,000 image-instruction-edit triples that incorporate rich reasoning contexts and real-world knowledge. We then propose R-Genie: a reasoning-guided generative image editor, which synergizes the generation power of diffusion models with advanced reasoning capabilities of multimodal large language models. R-Genie incorporates a reasoning-attention mechanism to bridge linguistic understanding with visual synthesis, enabling it to handle intricate editing requests involving abstract user intentions and contextual reasoning relations. Extensive experimental results validate that R-Genie can equip diffusion models with advanced reasoning-based editing capabilities, unlocking new potentials for intelligent image synthesis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes