CVApr 10, 2022

Deep Non-rigid Structure-from-Motion: A Sequence-to-Sequence Translation Perspective

arXiv:2204.04730v215 citationsh-index: 48
AI Analysis

This addresses 3D reconstruction of deforming objects from video for computer vision applications, representing an incremental improvement with a novel deep learning approach.

The paper tackles the Non-Rigid Structure-from-Motion (NRSfM) problem by modeling it as a sequence-to-sequence translation, reconstructing whole 3D sequences from 2D inputs, and demonstrates superiority across datasets like Human3.6M, CMU Mocap, and InterHand.

Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the whole 3D sequence from the input 2D sequence. In this paper, we propose to model deep NRSfM from a sequence-to-sequence translation perspective, where the input 2D frame sequence is taken as a whole to reconstruct the deforming 3D non-rigid shape sequence. First, we apply a shape-motion predictor to estimate the initial non-rigid shape and camera motion from a single frame. Then we propose a context modeling module to model camera motions and complex non-rigid shapes. To tackle the difficulty in enforcing the global structure constraint within the deep framework, we propose to impose the union-of-subspace structure by replacing the self-expressiveness layer with multi-head attention and delayed regularizers, which enables end-to-end batch-wise training. Experimental results across different datasets such as Human3.6M, CMU Mocap and InterHand prove the superiority of our framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes