Towards Efficient Exemplar Based Image Editing with Multimodal VLMs
This addresses the problem of ambiguous or cumbersome image edits for users by providing an efficient, optimization-free method, though it is incremental as it builds on existing diffusion models and VLMs.
The paper tackles exemplar-based image editing by using pretrained text-to-image diffusion models and multimodal VLMs to transfer edits from exemplar pairs to content images, achieving better performance than baselines on multiple edit types while being about 4 times faster.
Text-to-Image Diffusion models have enabled a wide array of image editing applications. However, capturing all types of edits through text alone can be challenging and cumbersome. The ambiguous nature of certain image edits is better expressed through an exemplar pair, i.e., a pair of images depicting an image before and after an edit respectively. In this work, we tackle exemplar-based image editing -- the task of transferring an edit from an exemplar pair to a content image(s), by leveraging pretrained text-to-image diffusion models and multimodal VLMs. Even though our end-to-end pipeline is optimization-free, our experiments demonstrate that it still outperforms baselines on multiple types of edits while being ~4x faster.