CVApr 24

SS3D: End2End Self-Supervised 3D from Web Videos

arXiv:2604.2268643.3
AI Analysis

This work addresses the challenge of scaling self-supervised 3D learning to unconstrained web video, enabling robust depth, ego-motion, and intrinsics estimation without labeled data.

SS3D introduces a self-supervised pretraining pipeline for monocular 3D estimation using web-scale video, achieving strong zero-shot transfer and improved fine-tuning over prior methods by training on ~100M frames from YouTube-8M.

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes