CVJul 22, 2021

DOVE: Learning Deformable 3D Objects by Watching Videos

arXiv:2107.10844v277 citations
AI Analysis

This addresses the challenge of 3D reconstruction for deformable objects in the wild, enabling applications in computer vision and graphics without costly annotations.

The paper tackles the problem of learning deformable 3D objects from 2D images without explicit supervision, and the result is DOVE, a method that learns textured 3D models from monocular videos, producing temporally consistent models that can be animated and rendered from arbitrary viewpoints.

Learning deformable 3D objects from 2D images is often an ill-posed problem. Existing methods rely on explicit supervision to establish multi-view correspondences, such as template shape models and keypoint annotations, which restricts their applicability on objects "in the wild". A more natural way of establishing correspondences is by watching videos of objects moving around. In this paper, we present DOVE, a method that learns textured 3D models of deformable object categories from monocular videos available online, without keypoint, viewpoint or template shape supervision. By resolving symmetry-induced pose ambiguities and leveraging temporal correspondences in videos, the model automatically learns to factor out 3D shape, articulated pose and texture from each individual RGB frame, and is ready for single-image inference at test time. In the experiments, we show that existing methods fail to learn sensible 3D shapes without additional keypoint or template supervision, whereas our method produces temporally consistent 3D models, which can be animated and rendered from arbitrary viewpoints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes