CVMar 10, 2025

SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models

arXiv:2503.06998v11 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses the challenge of seamless multi-style transitions in video stylization for applications in creative media and entertainment, representing an incremental improvement over prior methods.

The paper tackled the problem of generating smooth style transitions in video stylization, known as video style morphing, by introducing SOYO, a tuning-free diffusion-based framework that uses attention injection and AdaIN for structural consistency and an adaptive sampling scheduler for balanced style interpolation. The method outperformed existing approaches in preserving structural coherence and achieving stable, smooth style transitions across video frames.

Diffusion models have achieved remarkable progress in image and video stylization. However, most existing methods focus on single-style transfer, while video stylization involving multiple styles necessitates seamless transitions between them. We refer to this smooth style transition between video frames as video style morphing. Current approaches often generate stylized video frames with discontinuous structures and abrupt style changes when handling such transitions. To address these limitations, we introduce SOYO, a novel diffusion-based framework for video style morphing. Our method employs a pre-trained text-to-image diffusion model without fine-tuning, combining attention injection and AdaIN to preserve structural consistency and enable smooth style transitions across video frames. Moreover, we notice that applying linear equidistant interpolation directly induces imbalanced style morphing. To harmonize across video frames, we propose a novel adaptive sampling scheduler operating between two style images. Extensive experiments demonstrate that SOYO outperforms existing methods in open-domain video style morphing, better preserving the structural coherence of video frames while achieving stable and smooth style transitions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes