CVAug 5, 2024

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

arXiv:2408.02629v159 citationsh-index: 14
Originality Synthesis-oriented
AI Analysis

This addresses the lack of appropriate training datasets for text-to-video generation, which is a bottleneck for researchers and developers in AI video synthesis, though it is incremental as it focuses on dataset curation rather than a new model.

The authors tackled the problem of inadequate training datasets for text-to-video models by introducing VidGen-1M, a large-scale dataset with high-quality videos and detailed captions, which led to experimental results surpassing other models.

The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes