AbracADDbra: Touch-Guided Object Addition by Decoupling Placement and Editing Subtasks
This work addresses usability gaps in creative tools for users needing intuitive object addition, though it is incremental in improving interactive editing methods.
The paper tackles the problem of ambiguous text-only or tedious mask-based object addition by introducing AbracADDbra, a framework that uses touch priors for precise placement, achieving high-fidelity edits with a placement model that significantly outperforms baselines.
Instruction-based object addition is often hindered by the ambiguity of text-only prompts or the tedious nature of mask-based inputs. To address this usability gap, we introduce AbracADDbra, a user-friendly framework that leverages intuitive touch priors to spatially ground succinct instructions for precise placement. Our efficient, decoupled architecture uses a vision-language transformer for touch-guided placement, followed by a diffusion model that jointly generates the object and an instance mask for high-fidelity blending. To facilitate standardized evaluation, we contribute the Touch2Add benchmark for this interactive task. Our extensive evaluations, where our placement model significantly outperforms both random placement and general-purpose VLM baselines, confirm the framework's ability to produce high-fidelity edits. Furthermore, our analysis reveals a strong correlation between initial placement accuracy and final edit quality, validating our decoupled approach. This work thus paves the way for more accessible and efficient creative tools.