CVSep 1, 2025

An End-to-End Framework for Video Multi-Person Pose Estimation

arXiv:2509.01095v1
Originality Incremental advance
AI Analysis

This work addresses inefficiencies in video pose estimation for applications like surveillance or sports analysis, though it is incremental as it builds on existing Transformer-based methods.

The paper tackles the problem of video-based multi-person pose estimation by proposing an end-to-end framework that integrates spatial and temporal dimensions, outperforming two-stage models and improving inference efficiency by 300% on the Posetrack dataset.

Video-based human pose estimation models aim to address scenarios that cannot be effectively solved by static image models such as motion blur, out-of-focus and occlusion. Most existing approaches consist of two stages: detecting human instances in each image frame and then using a temporal model for single-person pose estimation. This approach separates the spatial and temporal dimensions and cannot capture the global spatio-temporal context between spatial instances for end-to-end optimization. In addition, it relies on separate detectors and complex post-processing such as RoI cropping and NMS, which reduces the inference efficiency of the video scene. To address the above problems, we propose VEPE (Video End-to-End Pose Estimation), a simple and flexible framework for end-to-end pose estimation in video. The framework utilizes three crucial spatio-temporal Transformer components: the Spatio-Temporal Pose Encoder (STPE), the Spatio-Temporal Deformable Memory Encoder (STDME), and the Spatio-Temporal Pose Decoder (STPD). These components are designed to effectively utilize temporal context for optimizing human body pose estimation. Furthermore, to reduce the mismatch problem during the cross-frame pose query matching process, we propose an instance consistency mechanism, which aims to enhance the consistency and discrepancy of the cross-frame instance query and realize the instance tracking function, which in turn accurately guides the pose query to perform cross-frame matching. Extensive experiments on the Posetrack dataset show that our approach outperforms most two-stage models and improves inference efficiency by 300%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes