CVMar 7, 2025

Multi-Grained Feature Pruning for Video-Based Human Pose Estimation

arXiv:2503.05365v1h-index: 8ICASSP
Originality Highly original
AI Analysis

This addresses efficiency and accuracy bottlenecks in video-based human pose estimation for applications like action recognition and motion capture.

The paper tackles redundant temporal information and fine-grained perception limitations in Transformer-based video pose estimation by proposing a multi-scale resolution framework with dynamic token pruning, achieving 93.8% faster inference speed and 87.4 mAP accuracy on PoseTrack2017.

Human pose estimation, with its broad applications in action recognition and motion capture, has experienced significant advancements. However, current Transformer-based methods for video pose estimation often face challenges in managing redundant temporal information and achieving fine-grained perception because they only focus on processing low-resolution features. To address these challenges, we propose a novel multi-scale resolution framework that encodes spatio-temporal representations at varying granularities and executes fine-grained perception compensation. Furthermore, we employ a density peaks clustering method to dynamically identify and prioritize tokens that offer important semantic information. This strategy effectively prunes redundant feature tokens, especially those arising from multi-frame features, thereby optimizing computational efficiency without sacrificing semantic richness. Empirically, it sets new benchmarks for both performance and efficiency on three large-scale datasets. Our method achieves a 93.8% improvement in inference speed compared to the baseline, while also enhancing pose estimation accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes