CVAIOct 22, 2021

Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation

arXiv:2110.11680v116 citations
Originality Incremental advance
AI Analysis

This work addresses temporal inconsistency in video-based 3D human reconstruction, which is important for applications like animation and motion analysis, but it appears incremental as it builds on existing multi-modality approaches.

The paper tackles the challenge of stable and accurate 3D human pose and shape estimation from RGB videos by proposing DTS-VIBE, a framework that fuses RGB and optical flow using a two-stream transformer network, resulting in significant performance improvements over state-of-the-art methods on Human3.6 and 3DPW datasets.

Several video-based 3D pose and shape estimation algorithms have been proposed to resolve the temporal inconsistency of single-image-based methods. However it still remains challenging to have stable and accurate reconstruction. In this paper, we propose a new framework Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation (DTS-VIBE), to generate 3D human pose and mesh from RGB videos. We reformulate the task as a multi-modality problem that fuses RGB and optical flow for more reliable estimation. In order to fully utilize both sensory modalities (RGB or optical flow), we train a two-stream temporal network based on transformer to predict SMPL parameters. The supplementary modality, optical flow, helps to maintain temporal consistency by leveraging motion knowledge between two consecutive frames. The proposed algorithm is extensively evaluated on the Human3.6 and 3DPW datasets. The experimental results show that it outperforms other state-of-the-art methods by a significant margin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes