CV AI LGMar 17, 2025

Training Video Foundation Models with NVIDIA NeMo

Zeeshan Patel, Ethan He, Parth Mannan, Xiaowei Ren, Ryan Wolf, Niket Agarwal, Jacob Huffman, Zhuoyao Wang, Carl Wang, Jack Chang, Yan Bai, Tommy Huang

arXiv:2503.12964v1h-index: 25Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of efficient VFM training for developers and researchers in AI, though it appears incremental as it builds on existing methods with a focus on scalability and best practices.

The paper tackles the challenge of training large-scale, high-quality Video Foundation Models (VFMs) by presenting a scalable, open-source training pipeline using NVIDIA NeMo, which includes accelerated dataset curation, multimodal data loading, and parallelized training and inference for video diffusion models.

Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.

View on arXiv PDF

Similar