CV RONov 14, 2024

Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos

Chengbo Yuan, Geng Chen, Li Yi, Yang Gao

arXiv:2411.09145v46.53 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the challenge of understanding geometry and dynamics in egocentric videos for computer vision and robotics, where labeled data is scarce, though it is incremental as it extends existing methods to a new domain.

The paper tackles the problem of dense scene reconstruction from egocentric videos by proposing EgoMono4D, a self-supervised model that unifies camera intrinsic, poses, and depth estimation, achieving superior performance in pointcloud sequence reconstruction compared to baselines.

Egocentric videos provide valuable insights into human interactions with the physical world, which has sparked growing interest in the computer vision and robotics communities. A critical challenge in fully understanding the geometry and dynamics of egocentric videos is dense scene reconstruction. However, the lack of high-quality labeled datasets in this field has hindered the effectiveness of current supervised learning methods. In this work, we aim to address this issue by exploring an self-supervised dynamic scene reconstruction approach. We introduce EgoMono4D, a novel model that unifies the estimation of multiple variables necessary for Egocentric Monocular 4D reconstruction, including camera intrinsic, camera poses, and video depth, all within a fast feed-forward framework. Starting from pretrained single-frame depth and intrinsic estimation model, we extend it with camera poses estimation and align multi-frame results on large-scale unlabeled egocentric videos. We evaluate EgoMono4D in both in-domain and zero-shot generalization settings, achieving superior performance in dense pointclouds sequence reconstruction compared to all baselines. EgoMono4D represents the first attempt to apply self-supervised learning for pointclouds sequence reconstruction to the label-scarce egocentric field, enabling fast, dense, and generalizable reconstruction. The interactable visualization, code and trained models are released https://egomono4d.github.io/

View on arXiv PDF

Similar