ITstyler: Image-optimized Text-based Style Transfer
This enables more efficient text-guided artistic image synthesis for creative applications, though it appears incremental as it builds on existing VGG and CLIP frameworks.
The paper tackles the problem of text-based style transfer requiring optimization time or paired data by developing a method that converts text to VGG style space using CLIP embeddings, achieving real-time transfer without inference optimization.
Text-based style transfer is a newly-emerging research topic that uses text information instead of style image to guide the transfer process, significantly extending the application scenario of style transfer. However, previous methods require extra time for optimization or text-image paired data, leading to limited effectiveness. In this work, we achieve a data-efficient text-based style transfer method that does not require optimization at the inference stage. Specifically, we convert text input to the style space of the pre-trained VGG network to realize a more effective style swap. We also leverage CLIP's multi-modal embedding space to learn the text-to-style mapping with the image dataset only. Our method can transfer arbitrary new styles of text input in real-time and synthesize high-quality artistic images.