CVAug 2, 2023

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

Yasheng Sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu, Lili Qiu, Hideki Koike

arXiv:2308.00906v123.862 citationsh-index: 32

Originality Incremental advance

AI Analysis

This addresses the problem of ambiguity in language-guided image editing for users, offering a more accessible and accurate alternative, though it builds incrementally on existing inpainting techniques.

The paper tackles the challenge of accurately reflecting human intentions in image manipulation by proposing ImageBrush, a method that uses visual instructions instead of language to achieve more precise editing, and demonstrates robust generalization across tasks like pose transfer and image translation.

While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.

View on arXiv PDF

Similar