DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
This addresses style transfer in text-to-image models for users needing precise control, but it is incremental as it builds on existing diffusion-based approaches.
The paper tackles the problem of text-to-image models losing text controllability when transferring reference styles, and introduces DEADiff, which achieves the best visual stylization results and optimal balance between text controllability and style similarity, as shown quantitatively and qualitatively.
The diffusion-based text-to-image model harbors immense potential in transferring reference style. However, current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles. In this paper, we introduce DEADiff to address this issue using the following two strategies: 1) a mechanism to decouple the style and semantics of reference images. The decoupled feature representations are first extracted by Q-Formers which are instructed by different text descriptions. Then they are injected into mutually exclusive subsets of cross-attention layers for better disentanglement. 2) A non-reconstructive learning method. The Q-Formers are trained using paired images rather than the identical target, in which the reference image and the ground-truth image are with the same style or semantics. We show that DEADiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image, as demonstrated both quantitatively and qualitatively. Our project page is https://tianhao-qi.github.io/DEADiff/.