CVOct 18, 2021

TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation

arXiv:2110.09554v379 citations
AI Analysis

This work addresses performance issues in 3D human pose estimation for applications like motion capture, but it is incremental as it builds on existing transformer and multi-view fusion methods.

The paper tackles the challenge of occlusions and oblique angles in multi-view 3D human pose estimation by introducing TransFusion, a transformer framework that directly improves 2D pose predictors by fusing information across views, achieving 25.8 mm MPJPE on Human 3.6M with 5M parameters.

Estimating the 2D human poses in each view is typically the first step in calibrated multi-view 3D pose estimation. But the performance of 2D pose detectors suffers from challenging situations such as occlusions and oblique viewing angles. To address these challenges, previous works derive point-to-point correspondences between different views from epipolar geometry and utilize the correspondences to merge prediction heatmaps or feature representations. Instead of post-prediction merge/calibration, here we introduce a transformer framework for multi-view 3D pose estimation, aiming at directly improving individual 2D predictors by integrating information from different views. Inspired by previous multi-modal transformers, we design a unified transformer architecture, named TransFusion, to fuse cues from both current views and neighboring views. Moreover, we propose the concept of epipolar field to encode 3D positional information into the transformer model. The 3D position encoding guided by the epipolar field provides an efficient way of encoding correspondences between pixels of different views. Experiments on Human 3.6M and Ski-Pose show that our method is more efficient and has consistent improvements compared to other fusion methods. Specifically, we achieve 25.8 mm MPJPE on Human 3.6M with only 5M parameters on 256 x 256 resolution.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes