Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion
This addresses the need for more effective and automated customized image generation, though it appears incremental as it builds on existing inversion and contrastive learning methods.
The paper tackles the problem of extracting a common concept from small image sets for customized generation without relying on manual guidance like text prompts or masks, and it achieves balanced, high-level performance in concept representation and editing, outperforming existing techniques.
The recent demand for customized image generation raises a need for techniques that effectively extract the common concept from small sets of images. Existing methods typically rely on additional guidance, such as text prompts or spatial masks, to capture the common target concept. Unfortunately, relying on manually provided guidance can lead to incomplete separation of auxiliary features, which degrades generation quality.In this paper, we propose Contrastive Inversion, a novel approach that identifies the common concept by comparing the input images without relying on additional information. We train the target token along with the image-wise auxiliary text tokens via contrastive learning, which extracts the well-disentangled true semantics of the target. Then we apply disentangled cross-attention fine-tuning to improve concept fidelity without overfitting. Experimental results and analysis demonstrate that our method achieves a balanced, high-level performance in both concept representation and editing, outperforming existing techniques.