CVJul 20, 2022

OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos

arXiv:2207.09725v216 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses pose estimation in videos for computer vision applications, offering a solution that reduces annotation labor and handles common challenges like occlusion, though it is incremental in its approach.

The paper tackled the problem of multi-human pose estimation in videos with sparse annotations, occlusion, and motion blur by proposing an occlusion-aware transformer method, achieving state-of-the-art results on PoseTrack2017 and PoseTrack2018 datasets.

Although many approaches for multi-human pose estimation in videos have shown profound results, they require densely annotated data which entails excessive man labor. Furthermore, there exists occlusion and motion blur that inevitably lead to poor estimation performance. To address these problems, we propose a method that leverages an attention mask for occluded joints and encodes temporal dependency between frames using transformers. First, our framework composes different combinations of sparsely annotated frames that denote the track of the overall joint movement. We propose an occlusion attention mask from these combinations that enable encoding occlusion-aware heatmaps as a semi-supervised task. Second, the proposed temporal encoder employs transformer architecture to effectively aggregate the temporal relationship and keypoint-wise attention from each time step and accurately refines the target frame's final pose estimation. We achieve state-of-the-art pose estimation results for PoseTrack2017 and PoseTrack2018 datasets and demonstrate the robustness of our approach to occlusion and motion blur in sparsely annotated video data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes