CVSep 24, 2019

Multi-Person 3D Human Pose Estimation from Monocular Images

arXiv:1909.10854v158 citations
Originality Incremental advance
AI Analysis

This addresses the problem of 3D pose estimation in real-world settings for computer vision applications, but it is incremental as it builds on existing architectures like Mask-RCNN and Hourglass.

The paper tackles multi-person 3D human pose estimation from monocular images by proposing HG-RCNN, a two-stage network that estimates 2D keypoints and lifts them to 3D without requiring multi-person 3D datasets, achieving state-of-the-art results on MuPoTS-3D.

Multi-person 3D human pose estimation from a single image is a challenging problem, especially for in-the-wild settings due to the lack of 3D annotated data. We propose HG-RCNN, a Mask-RCNN based network that also leverages the benefits of the Hourglass architecture for multi-person 3D Human Pose Estimation. A two-staged approach is presented that first estimates the 2D keypoints in every Region of Interest (RoI) and then lifts the estimated keypoints to 3D. Finally, the estimated 3D poses are placed in camera-coordinates using weak-perspective projection assumption and joint optimization of focal length and root translations. The result is a simple and modular network for multi-person 3D human pose estimation that does not require any multi-person 3D pose dataset. Despite its simple formulation, HG-RCNN achieves the state-of-the-art results on MuPoTS-3D while also approximating the 3D pose in the camera-coordinate system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes