CVAILGDec 19, 2024

Scaling 4D Representations

arXiv:2412.15212v225 citationsh-index: 43Has Code
AI Analysis

This work addresses the challenge of scaling self-supervised learning for video-based spatial and temporal tasks, which is incremental as it builds on existing methods like MAE but applies them to new task domains.

The paper tackles the problem of scaling self-supervised learning for video by focusing on non-semantic 4D tasks like camera pose estimation and depth estimation, showing that masked auto-encoding with transformer models scales effectively, improving performance as model size increases up to 22B parameters.

Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations. Pretrained models are available at https://github.com/google-deepmind/representations4d .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes