Inst-Inpaint: Instructing to Remove Objects with Diffusion Models
This work addresses the time-consuming and error-prone process of mask creation in image editing for users, though it is incremental as it builds on existing diffusion and GAN methods.
The authors tackled the problem of object removal in image inpainting by developing a method that uses natural language instructions instead of binary masks, eliminating the need for manual mask generation. They introduced a new dataset and framework, achieving significant improvements in quality and accuracy over baseline models.
Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.