CVAIJun 1, 2023

ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation

arXiv:2306.00971v223 citationsh-index: 50Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of flexible and scalable deployment for personalized image generation, offering a novel approach that avoids the computational burden of fine-tuning, though it is incremental in improving efficiency within existing frameworks.

The paper tackles personalized text-to-image generation by introducing ViCo, a plug-and-play method that integrates visual conditions without fine-tuning the diffusion model, achieving performance on par with or surpassing state-of-the-art models while using only about 6% of the parameters compared to the diffusion U-Net.

Personalized text-to-image generation using diffusion models has recently emerged and garnered significant interest. This task learns a novel concept (e.g., a unique toy), illustrated in a handful of images, into a generative model that captures fine visual details and generates photorealistic images based on textual embeddings. In this paper, we present ViCo, a novel lightweight plug-and-play method that seamlessly integrates visual condition into personalized text-to-image generation. ViCo stands out for its unique feature of not requiring any fine-tuning of the original diffusion model parameters, thereby facilitating more flexible and scalable model deployment. This key advantage distinguishes ViCo from most existing models that necessitate partial or full diffusion fine-tuning. ViCo incorporates an image attention module that conditions the diffusion process on patch-wise visual semantics, and an attention-based object mask that comes at no extra cost from the attention module. Despite only requiring light parameter training (~6% compared to the diffusion U-Net), ViCo delivers performance that is on par with, or even surpasses, all state-of-the-art models, both qualitatively and quantitatively. This underscores the efficacy of ViCo, making it a highly promising solution for personalized text-to-image generation without the need for diffusion model fine-tuning. Code: https://github.com/haoosz/ViCo

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes