CVMar 7, 2025

Multi-Grained Feature Pruning for Video-Based Human Pose Estimation

Zhigang Wang, Shaojing Fan, Zhenguang Liu, Zheqi Wu, Sifan Wu, Yingying Jiao

arXiv:2503.05365v1h-index: 8ICASSP

Originality Highly original

AI Analysis

This addresses efficiency and accuracy bottlenecks in video-based human pose estimation for applications like action recognition and motion capture.

The paper tackles redundant temporal information and fine-grained perception limitations in Transformer-based video pose estimation by proposing a multi-scale resolution framework with dynamic token pruning, achieving 93.8% faster inference speed and 87.4 mAP accuracy on PoseTrack2017.

Human pose estimation, with its broad applications in action recognition and motion capture, has experienced significant advancements. However, current Transformer-based methods for video pose estimation often face challenges in managing redundant temporal information and achieving fine-grained perception because they only focus on processing low-resolution features. To address these challenges, we propose a novel multi-scale resolution framework that encodes spatio-temporal representations at varying granularities and executes fine-grained perception compensation. Furthermore, we employ a density peaks clustering method to dynamically identify and prioritize tokens that offer important semantic information. This strategy effectively prunes redundant feature tokens, especially those arising from multi-frame features, thereby optimizing computational efficiency without sacrificing semantic richness. Empirically, it sets new benchmarks for both performance and efficiency on three large-scale datasets. Our method achieves a 93.8% improvement in inference speed compared to the baseline, while also enhancing pose estimation accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.

View on arXiv PDF

Similar