Style-Content Disentanglement in Language-Image Pretraining Representations for Zero-Shot Sketch-to-Image Synthesis
This work addresses the challenge of generating images from sketches in an open-domain setting, offering a training-free solution that is incremental in its use of existing pretrained models.
The authors tackled the problem of zero-shot sketch-to-image synthesis by leveraging language-image pretraining representations, achieving competitive results with state-of-the-art models without retraining any parameters and using only a fraction of the data.
In this work, we propose and validate a framework to leverage language-image pretraining representations for training-free zero-shot sketch-to-image synthesis. We show that disentangled content and style representations can be utilized to guide image generators to employ them as sketch-to-image generators without (re-)training any parameters. Our approach for disentangling style and content entails a simple method consisting of elementary arithmetic assuming compositionality of information in representations of input sketches. Our results demonstrate that this approach is competitive with state-of-the-art instance-level open-domain sketch-to-image models, while only depending on pretrained off-the-shelf models and a fraction of the data.