CVNov 16, 2023

Improved TokenPose with Sparsity

arXiv:2311.09653v1h-index: 1
Originality Incremental advance
AI Analysis

This work addresses the problem of scaling transformer-based methods to high-resolution features for human pose estimation, which is incremental as it builds on existing TokenPose approaches.

The paper tackles the computational challenge of global attention in transformer-based human pose estimation by introducing sparsity in keypoint and visual token attention, achieving new state-of-the-art results on the MPII dataset with improved accuracy.

Over the past few years, the vision transformer and its various forms have gained significance in human pose estimation. By treating image patches as tokens, transformers can capture global relationships wisely, estimate the keypoint tokens by leveraging the visual tokens, and recognize the posture of the human body. Nevertheless, global attention is computationally demanding, which poses a challenge for scaling up transformer-based methods to high-resolution features. In this paper, we introduce sparsity in both keypoint token attention and visual token attention to improve human pose estimation. Experimental results on the MPII dataset demonstrate that our model has a higher level of accuracy and proved the feasibility of the method, achieving new state-of-the-art results. The idea can also provide references for other transformer-based models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes