CVAINov 30, 2024

FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting

arXiv:2412.00427v13 citationsh-index: 28
Originality Incremental advance
AI Analysis

This addresses a specific deficiency in text-guided inpainting models for users needing accurate image generation from complex inputs, but it is incremental as it builds on existing SDI methods.

The study tackled the problem of Stable Diffusion Inpainting (SDI) failing to follow both prompt and mask instructions due to training bias, resulting in improved generation quality with up to a 60% and 58% increase in CLIP score for SDI and SDXLI models.

In this study, we aim to determine and solve the deficiency of Stable Diffusion Inpainting (SDI) in following the instruction of both prompt and mask. Due to the training bias from masking, the inpainting quality is hindered when the prompt instruction and image condition are not related. Therefore, we conduct a detailed analysis of the internal representations learned by SDI, focusing on how the mask input influences the cross-attention layer. We observe that adapting text key tokens toward the input mask enables the model to selectively paint within the given area. Leveraging these insights, we propose FreeCond, which adjusts only the input mask condition and image condition. By increasing the latent mask value and modifying the frequency of image condition, we align the cross-attention features with the model's training bias to improve generation quality without additional computation, particularly when user inputs are complicated and deviate from the training setup. Extensive experiments demonstrate that FreeCond can enhance any SDI-based model, e.g., yielding up to a 60% and 58% improvement of SDI and SDXLI in the CLIP score.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes