Bridging CLIP and StyleGAN through Latent Alignment for Image Editing
This work addresses the challenge of efficient and diverse text-driven image editing for users in computer vision and graphics, representing an incremental improvement over previous methods that required test-time optimization or limited manipulation directions.
The paper tackles the problem of text-driven image manipulation by bridging CLIP and StyleGAN through latent alignment to achieve inference-time optimization-free diverse manipulation direction mining, resulting in improved performance in tasks like GAN inversion, text-to-image generation, and image manipulation as demonstrated through qualitative and quantitative comparisons.
Text-driven image manipulation is developed since the vision-language model (CLIP) has been proposed. Previous work has adopted CLIP to design a text-image consistency-based objective to address this issue. However, these methods require either test-time optimization or image feature cluster analysis for single-mode manipulation direction. In this paper, we manage to achieve inference-time optimization-free diverse manipulation direction mining by bridging CLIP and StyleGAN through Latent Alignment (CSLA). More specifically, our efforts consist of three parts: 1) a data-free training strategy to train latent mappers to bridge the latent space of CLIP and StyleGAN; 2) for more precise mapping, temporal relative consistency is proposed to address the knowledge distribution bias problem among different latent spaces; 3) to refine the mapped latent in s space, adaptive style mixing is also proposed. With this mapping scheme, we can achieve GAN inversion, text-to-image generation and text-driven image manipulation. Qualitative and quantitative comparisons are made to demonstrate the effectiveness of our method.