CVAIDec 23, 2025

How Much 3D Do Video Foundation Models Encode?

arXiv:2512.19949v15 citationsh-index: 19
Originality Synthesis-oriented
AI Analysis

This provides insights into building scalable 3D models by evaluating existing video models, though it is incremental as it focuses on benchmarking rather than proposing new methods.

The study quantified the 3D understanding of Video Foundation Models (VidFMs) pretrained on large video data, showing that state-of-the-art video generation models exhibit strong 3D awareness, even surpassing expert models trained for 3D tasks.

Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes