CVMay 7, 2020

Self-Supervised Human Depth Estimation from Monocular Videos

arXiv:2005.03358v133 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of collecting supervised depth data for human depth estimation, making training simpler and more generalizable for applications in computer vision.

The paper tackles the problem of estimating detailed human depth from monocular videos without requiring ground truth depth data, achieving better generalization and performance on in-the-wild data through a self-supervised method that minimizes photo-consistency loss using estimated non-rigid body motion.

Previous methods on estimating detailed human depth often require supervised training with `ground truth' depth data. This paper presents a self-supervised method that can be trained on YouTube videos without known depth, which makes training data collection simple and improves the generalization of the learned network. The self-supervised learning is achieved by minimizing a photo-consistency loss, which is evaluated between a video frame and its neighboring frames warped according to the estimated depth and the 3D non-rigid motion of the human body. To solve this non-rigid motion, we first estimate a rough SMPL model at each video frame and compute the non-rigid body motion accordingly, which enables self-supervised learning on estimating the shape details. Experiments demonstrate that our method enjoys better generalization and performs much better on data in the wild.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes