CVApr 11, 2025

EMO-X: Efficient Multi-Person Pose and Shape Estimation in One-Stage

Haohang Jian, Jinlu Zhang, Junyi Wu, Zhigang Tu

arXiv:2504.08718v16.21 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses the problem of high computational overhead in multi-person human pose and shape estimation for applications like animation or robotics, representing an incremental improvement by combining existing techniques for efficiency gains.

The paper tackles the computational inefficiency of Transformer-based methods for multi-person expressive human pose and shape estimation by proposing EMO-X, a one-stage model that integrates global context with local features, achieving a 69.8% reduction in inference time while maintaining or improving accuracy compared to state-of-the-art methods.

Expressive Human Pose and Shape Estimation (EHPS) aims to jointly estimate human pose, hand gesture, and facial expression from monocular images. Existing methods predominantly rely on Transformer-based architectures, which suffer from quadratic complexity in self-attention, leading to substantial computational overhead, especially in multi-person scenarios. Recently, Mamba has emerged as a promising alternative to Transformers due to its efficient global modeling capability. However, it remains limited in capturing fine-grained local dependencies, which are essential for precise EHPS. To address these issues, we propose EMO-X, the Efficient Multi-person One-stage model for multi-person EHPS. Specifically, we explore a Scan-based Global-Local Decoder (SGLD) that integrates global context with skeleton-aware local features to iteratively enhance human tokens. Our EMO-X leverages the superior global modeling capability of Mamba and designs a local bidirectional scan mechanism for skeleton-aware local refinement. Comprehensive experiments demonstrate that EMO-X strikes an excellent balance between efficiency and accuracy. Notably, it achieves a significant reduction in computational complexity, requiring 69.8% less inference time compared to state-of-the-art (SOTA) methods, while outperforming most of them in accuracy.

View on arXiv PDF

Similar