SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations
This addresses the problem of unconstrained lower-body reconstruction for AR/VR applications, offering a lightweight and efficient solution with incremental improvements in speed and smoothness.
The paper tackles real-time full-body pose estimation from sparse head and hand signals in AR/VR, proposing SSD-Poser, which achieves exceptional accuracy and computational efficiency with outstanding inference speed compared to state-of-the-art methods.
The growing applications of AR/VR increase the demand for real-time full-body pose estimation from Head-Mounted Displays (HMDs). Although HMDs provide joint signals from the head and hands, reconstructing a full-body pose remains challenging due to the unconstrained lower body. Recent advancements often rely on conventional neural networks and generative models to improve performance in this task, such as Transformers and diffusion models. However, these approaches struggle to strike a balance between achieving precise pose reconstruction and maintaining fast inference speed. To overcome these challenges, a lightweight and efficient model, SSD-Poser, is designed for robust full-body motion estimation from sparse observations. SSD-Poser incorporates a well-designed hybrid encoder, State Space Attention Encoders, to adapt the state space duality to complex motion poses and enable real-time realistic pose reconstruction. Moreover, a Frequency-Aware Decoder is introduced to mitigate jitter caused by variable-frequency motion signals, remarkably enhancing the motion smoothness. Comprehensive experiments on the AMASS dataset demonstrate that SSD-Poser achieves exceptional accuracy and computational efficiency, showing outstanding inference efficiency compared to state-of-the-art methods.