CVFeb 19, 2024

Zero-Shot Video Translation via Token Warping

Haiming Zhu, Yangyang Xu, Jun Yu, Shengfeng He

arXiv:2402.12099v33.71 citationsh-index: 11IEEE Trans Vis Comput Graph

Originality Incremental advance

AI Analysis

This addresses the challenge of user control and visual quality in video generation for AI applications, representing an incremental improvement over existing diffusion-based approaches.

The paper tackles the problem of temporally coherent video translation by introducing TokenWarping, a framework that uses optical flows to warp query, key, and value patches in diffusion models, resulting in improved visual quality and temporal consistency over state-of-the-art methods.

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame's query, key, and value patches, aligning them with the current frame's patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively. Video demonstrations are available in supplementary materials.

View on arXiv PDF

Similar