CVApr 12, 2023

Distilling Token-Pruned Pose Transformer for 2D Human Pose Estimation

arXiv:2304.05548v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses efficiency and accuracy trade-offs in human pose estimation for computer vision applications, representing an incremental improvement over prior token-pruning methods.

The paper tackles the performance degradation of token-pruned pose transformers in 2D human pose estimation by introducing a distillation method that uses a pre-trained TokenPose model to supervise the pruned model, resulting in improved PCK scores on the MPII dataset while maintaining reduced computational complexity.

Human pose estimation has seen widespread use of transformer models in recent years. Pose transformers benefit from the self-attention map, which captures the correlation between human joint tokens and the image. However, training such models is computationally expensive. The recent token-Pruned Pose Transformer (PPT) solves this problem by pruning the background tokens of the image, which are usually less informative. However, although it improves efficiency, PPT inevitably leads to worse performance than TokenPose due to the pruning of tokens. To overcome this problem, we present a novel method called Distilling Pruned-Token Transformer for human pose estimation (DPPT). Our method leverages the output of a pre-trained TokenPose to supervise the learning process of PPT. We also establish connections between the internal structure of pose transformers and PPT, such as attention maps and joint features. Our experimental results on the MPII datasets show that our DPPT can significantly improve PCK compared to previous PPT models while still reducing computational complexity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes