SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering
This work addresses a key limitation in 3D human pose estimation for applications like augmented reality and human-robot interaction, offering a novel approach that improves generalization and robustness, though it is incremental in adapting Gaussian Splatting to a new domain.
The paper tackles the problem of poor generalization in multi-view 3D human pose estimation by proposing SkelSplat, a framework using differentiable Gaussian rendering to model human pose as a skeleton of 3D Gaussians, which reduces cross-dataset error by up to 47.8% compared to learning-based methods and shows robustness to occlusions without fine-tuning.
Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: https://skelsplat.github.io.