CVSep 29, 2023

Guiding Instruction-based Image Editing via Multimodal Large Language Models

Microsoft
arXiv:2309.17102v2178 citationsh-index: 74
Originality Highly original
AI Analysis

This work addresses a key limitation in image editing for users seeking flexible control via natural commands, representing a novel method for a known bottleneck rather than incremental.

The paper tackles the problem of instruction-based image editing where human instructions are often too brief for current methods, by introducing MLLM-Guided Image Editing (MGIE) which learns to derive expressive instructions and provide explicit guidance, resulting in notable improvements in automatic metrics and human evaluation while maintaining competitive inference efficiency.

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes