CVApr 10, 2022

Deep Non-rigid Structure-from-Motion: A Sequence-to-Sequence Translation Perspective

Hui Deng, Tong Zhang, Yuchao Dai, Jiawei Shi, Yiran Zhong, Hongdong Li

arXiv:2204.04730v28.815 citationsh-index: 48

Originality Incremental advance

AI Analysis

This addresses 3D reconstruction of deforming objects from video for computer vision applications, representing an incremental improvement with a novel deep learning approach.

The paper tackles the Non-Rigid Structure-from-Motion (NRSfM) problem by modeling it as a sequence-to-sequence translation, reconstructing whole 3D sequences from 2D inputs, and demonstrates superiority across datasets like Human3.6M, CMU Mocap, and InterHand.

Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the whole 3D sequence from the input 2D sequence. In this paper, we propose to model deep NRSfM from a sequence-to-sequence translation perspective, where the input 2D frame sequence is taken as a whole to reconstruct the deforming 3D non-rigid shape sequence. First, we apply a shape-motion predictor to estimate the initial non-rigid shape and camera motion from a single frame. Then we propose a context modeling module to model camera motions and complex non-rigid shapes. To tackle the difficulty in enforcing the global structure constraint within the deep framework, we propose to impose the union-of-subspace structure by replacing the self-expressiveness layer with multi-head attention and delayed regularizers, which enables end-to-end batch-wise training. Experimental results across different datasets such as Human3.6M, CMU Mocap and InterHand prove the superiority of our framework.

View on arXiv PDF

Similar