CVGRROIVAug 12, 2025

ViPE: Video Pose Engine for 3D Geometric Perception

NVIDIAU of Toronto
arXiv:2508.10934v195 citationsh-index: 44Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of 3D geometric perception for spatial AI systems by providing a versatile tool and large annotated dataset, though it appears incremental as it builds on existing pose estimation methods.

The paper tackles the challenge of acquiring consistent 3D annotations from in-the-wild videos by introducing ViPE, a video processing engine that estimates camera intrinsics, motion, and dense depth maps from unconstrained videos, outperforming existing baselines by 18%/50% on TUM/KITTI sequences and running at 3-5FPS on a single GPU.

Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360° panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes