CVFeb 2

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

arXiv:2602.02092v13 citationsh-index: 14Has Code
Originality Incremental advance
AI Analysis

This addresses the computational bottleneck in video generation for AI researchers and practitioners, though it appears incremental as it builds on existing diffusion transformer architectures.

The authors tackled the problem of slow video generation by developing FSVideo, a transformer-based image-to-video diffusion model that operates in a highly-compressed latent space, achieving competitive performance while being an order of magnitude faster than other open-source models.

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes