CVAILGJun 9, 2022

Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation

arXiv:2206.04785v14 citationsh-index: 42
Originality Incremental advance
AI Analysis

This work addresses the problem of accurate 3D pose estimation for applications like virtual reality or robotics, but it is incremental as it builds on existing Transformer and feature map methods.

The paper tackles the challenge of egocentric 3D human pose estimation from images, which suffers from self-occlusions and distortion, by proposing Ego-STAN, a spatio-temporal Transformer model that leverages past frames and feature map tokens, resulting in a 30.6% improvement in mean per-joint position error and a 22% reduction in parameters compared to the state-of-the-art.

Egocentric 3D human pose estimation (HPE) from images is challenging due to severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera. Although existing works use intermediate heatmap-based representations to counter distortion with some success, addressing self-occlusion remains an open problem. In this work, we leverage information from past frames to guide our self-attention-based 3D HPE estimation procedure -- Ego-STAN. Specifically, we build a spatio-temporal Transformer model that attends to semantically rich convolutional neural network-based feature maps. We also propose feature map tokens: a new set of learnable parameters to attend to these feature maps. Finally, we demonstrate Ego-STAN's superior performance on the xR-EgoPose dataset where it achieves a 30.6% improvement on the overall mean per-joint position error, while leading to a 22% drop in parameters compared to the state-of-the-art.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes