CVMay 17

VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers

Yiren Song, Wangzi Yao, Haofan Wang, Mike Zheng Shou

arXiv:2605.1731279.31 citations

Predicted impact top 21% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For video stylization researchers, this work addresses temporal inconsistency with a principled training paradigm and dataset, though the dataset is synthetic.

Video style transfer suffers from temporal inconsistency due to brittle heuristic propagation. VISTA introduces a synthetic dataset with 1,000 styles and a diffusion-transformer framework with a style adapter, achieving SOTA in style fidelity, temporal consistency, and content preservation.

Video style transfer aims to render videos in a target artistic style while preserving content, structure, and motion. While image stylization has advanced rapidly, video stylization remains challenging due to temporal inconsistency. Most existing methods stylize frames or keyframes and enforce consistency via heuristic temporal propagation, which is brittle under occlusions, disocclusions, and long-term motion, leading to drift and flickering artifacts. We argue that a fundamental bottleneck lies in the lack of large-scale triplet data and a principled training paradigm that jointly models and disentangles style, content, and motion.To address this, we introduce VISTA-1000, a synthetic dataset with 1,000 styles and motion-aligned triplets of style reference, clean video, and stylized video, and propose a diffusion-transformer-based in-context video style transfer framework with a lightweight style adapter for robust style extraction. Extensive experiments demonstrate SOTA performance in style fidelity, temporal consistency, and content preservation.

View on arXiv PDF

Similar