CVCLLGNov 28, 2023

Optimisation-Based Multi-Modal Semantic Image Editing

arXiv:2311.16882v1h-index: 6
Originality Incremental advance
AI Analysis

This work addresses the need for more precise and flexible image editing tools for users, though it is incremental as it builds on existing editing frameworks by extending them to multi-modal inputs.

The paper tackles the problem of limited precision and accuracy in text-based image editing by proposing an inference-time optimization method that supports multiple instruction types like spatial layout, pose, and scribbles, achieving complex edits as demonstrated through qualitative and quantitative experiments.

Image editing affords increased control over the aesthetics and content of generated images. Pre-existing works focus predominantly on text-based instructions to achieve desired image modifications, which limit edit precision and accuracy. In this work, we propose an inference-time editing optimisation, designed to extend beyond textual edits to accommodate multiple editing instruction types (e.g. spatial layout-based; pose, scribbles, edge maps). We propose to disentangle the editing task into two competing subtasks: successful local image modifications and global content consistency preservation, where subtasks are guided through two dedicated loss functions. By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences. We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits, through both qualitative and quantitative experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes