CVMar 29, 2022

On Triangulation as a Form of Self-Supervision for 3D Human Pose Estimation

arXiv:2203.15865v313 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of reducing annotation costs for 3D pose estimation in crowded scenes, offering a semi-supervised method that is incremental but practical for real-world applications.

The paper tackles the problem of 3D human pose estimation from single images without ground-truth labels by using multi-view geometrical constraints through weighted differentiable triangulation as self-supervision, achieving effective results on datasets like Human3.6M and MPI-INF-3DHP, including in occluded scenes.

Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant. However, as the acquisition of ground-truth 3D labels is labor intensive and time consuming, recent attention has shifted towards semi- and weakly-supervised learning. Generating an effective form of supervision with little annotations still poses major challenge in crowded scenes. In this paper we propose to impose multi-view geometrical constraints by means of a weighted differentiable triangulation and use it as a form of self-supervision when no labels are available. We therefore train a 2D pose estimator in such a way that its predictions correspond to the re-projection of the triangulated 3D pose and train an auxiliary network on them to produce the final 3D poses. We complement the triangulation with a weighting mechanism that alleviates the impact of noisy predictions caused by self-occlusion or occlusion from other subjects. We demonstrate the effectiveness of our semi-supervised approach on Human3.6M and MPI-INF-3DHP datasets, as well as on a new multi-view multi-person dataset that features occlusion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes