Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing
This work addresses ambiguity in text-based image editing for users, offering a zero-shot solution that improves editing accuracy and diversity, though it is incremental as it builds on existing diffusion models.
The paper tackles the problem of limited performance in text-based image editing diffusion models when user instructions are ambiguous, proposing SANE, a zero-shot inference pipeline that uses an LLM to decompose ambiguous instructions into specific ones and a novel denoising guidance strategy, resulting in improved performance across three baselines and two datasets, with enhanced interpretability and output diversity.
Text-based editing diffusion models exhibit limited performance when the user's input instruction is ambiguous. To solve this problem, we propose $\textit{Specify ANd Edit}$ (SANE), a zero-shot inference pipeline for diffusion-based editing systems. We use a large language model (LLM) to decompose the input instruction into specific instructions, i.e. well-defined interventions to apply to the input image to satisfy the user's request. We benefit from the LLM-derived instructions along the original one, thanks to a novel denoising guidance strategy specifically designed for the task. Our experiments with three baselines and on two datasets demonstrate the benefits of SANE in all setups. Moreover, our pipeline improves the interpretability of editing models, and boosts the output diversity. We also demonstrate that our approach can be applied to any edit, whether ambiguous or not. Our code is public at https://github.com/fabvio/SANE.