CVJun 5, 2025

ContentV: Efficient Training of Video Generation Models with Limited Compute

arXiv:2506.05343v25 citationsh-index: 4
Originality Highly original
AI Analysis

This work addresses the challenge of efficient training for video generation models, which is crucial for researchers and practitioners with limited compute resources, representing a strong specific gain rather than a foundational advancement.

The paper tackles the problem of high computational costs in video generation by introducing ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance with a score of 85.14 on VBench after training on 256 x 64GB NPUs for only four weeks.

Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: https://contentv.github.io.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes