CVMar 24

ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

Yunfeng Wu, Hongying Cheng, Zihao He, Songhua Liu

arXiv:2603.2332626.9Has Code

Predicted impact top 35% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of expensive video synthesis for researchers and practitioners by offering an incremental efficiency gain.

The paper tackles the high computational cost of training transformer-based video diffusion models for ultra-high-resolution videos by proposing a pure image adaptation framework that upgrades pre-trained models without video data, achieving a 0.8 improvement on the VBench benchmark.

Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at https://github.com/WillWu111/ViBe.

View on arXiv PDF Code

Similar