CVOct 31, 2024

Learning Video Representations without Natural Videos

arXiv:2410.24213v22 citationsh-index: 16
Originality Highly original
AI Analysis

This provides a more controllable and transparent alternative to video data curation for pre-training, benefiting researchers and practitioners in computer vision.

The paper tackles the problem of learning video representations without using natural videos by training on synthetic videos and natural images, achieving 97.2% of the performance gap on UCF101 action classification compared to self-supervised pre-training from natural videos and outperforming on HMDB51 and 11 out of 14 out-of-distribution datasets.

We show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g., motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2\% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes