CVNov 26, 2025

MIRA: Multimodal Iterative Reasoning Agent for Image Editing

arXiv:2511.21087v25 citationsHas Code
Originality Highly original
AI Analysis

This addresses the challenge of accurate instruction interpretation in image editing for users, representing a novel method for a known bottleneck rather than a foundational advance.

The paper tackles the problem of diffusion-based image editing models struggling with complex user instructions by proposing MIRA, a lightweight multimodal reasoning agent that uses an iterative perception-reasoning-action loop to predict atomic edit instructions step by step, achieving performance comparable to or exceeding proprietary systems when paired with open-source editing models.

Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes