CAMILA: Context-Aware Masking for Image Editing with Language Alignment
This addresses the challenge of unreliable image editing for users by improving robustness to complex instructions, though it is incremental as it builds on existing text-guided editing frameworks.
The paper tackles the problem of text-guided image editing models naively following infeasible or contradictory user instructions, which leads to nonsensical outputs, by proposing CAMILA, a context-aware method that validates instruction coherence and applies only relevant edits. It achieves better performance and higher semantic alignment than state-of-the-art models on newly constructed datasets for single- and multi-instruction editing.
Text-guided image editing has been allowing users to transform and synthesize images through natural language instructions, offering considerable flexibility. However, most existing image editing models naively attempt to follow all user instructions, even if those instructions are inherently infeasible or contradictory, often resulting in nonsensical output. To address these challenges, we propose a context-aware method for image editing named as CAMILA (Context-Aware Masking for Image Editing with Language Alignment). CAMILA is designed to validate the contextual coherence between instructions and the image, ensuring that only relevant edits are applied to the designated regions while ignoring non-executable instructions. For comprehensive evaluation of this new method, we constructed datasets for both single- and multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity.