CVSep 15, 2021

Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos

arXiv:2109.07353v136 citations
Originality Incremental advance
AI Analysis

This work addresses depth ambiguity and motion uncertainty in 3D pose estimation for video analysis, representing an incremental improvement over existing graph-based methods.

The paper tackles the problem of 3D human pose estimation in videos by proposing a Dynamical Graph Network (DG-Net) that dynamically identifies human-joint affinity to adapt to complex spatio-temporal variations, resulting in outperforming recent state-of-the-art approaches on benchmarks like Human3.6M with fewer input frames and model size.

Graph Convolution Network (GCN) has been successfully used for 3D human pose estimation in videos. However, it is often built on the fixed human-joint affinity, according to human skeleton. This may reduce adaptation capacity of GCN to tackle complex spatio-temporal pose variations in videos. To alleviate this problem, we propose a novel Dynamical Graph Network (DG-Net), which can dynamically identify human-joint affinity, and estimate 3D pose by adaptively learning spatial/temporal joint relations from videos. Different from traditional graph convolution, we introduce Dynamical Spatial/Temporal Graph convolution (DSG/DTG) to discover spatial/temporal human-joint affinity for each video exemplar, depending on spatial distance/temporal movement similarity between human joints in this video. Hence, they can effectively understand which joints are spatially closer and/or have consistent motion, for reducing depth ambiguity and/or motion uncertainty when lifting 2D pose to 3D pose. We conduct extensive experiments on three popular benchmarks, e.g., Human3.6M, HumanEva-I, and MPI-INF-3DHP, where DG-Net outperforms a number of recent SOTA approaches with fewer input frames and model size.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes