CVJun 16, 2025

DETRPose: Real-time end-to-end transformer model for multi-person pose estimation

arXiv:2506.13027v12 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This addresses the need for efficient pose estimation in computer vision and virtual reality applications, though it appears incremental as it builds on existing transformer architectures.

The paper tackles the lack of real-time transformer-based models for multi-person pose estimation by introducing a family of models that achieve competitive inference times, training 5 to 10 times faster with fewer parameters.

Multi-person pose estimation (MPPE) estimates keypoints for all individuals present in an image. MPPE is a fundamental task for several applications in computer vision and virtual reality. Unfortunately, there are currently no transformer-based models that can perform MPPE in real time. The paper presents a family of transformer-based models capable of performing multi-person 2D pose estimation in real-time. Our approach utilizes a modified decoder architecture and keypoint similarity metrics to generate both positive and negative queries, thereby enhancing the quality of the selected queries within the architecture. Compared to state-of-the-art models, our proposed models train much faster, using 5 to 10 times fewer epochs, with competitive inference times without requiring quantization libraries to speed up the model. Furthermore, our proposed models provide competitive results or outperform alternative models, often using significantly fewer parameters.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes