RAPTR: Radar-based 3D Pose Estimation using Transformer
This addresses the challenge of costly fine-grained 3D labeling for indoor radar pose estimation, offering a more scalable solution for applications like surveillance or human-computer interaction.
The paper tackles the problem of radar-based indoor 3D human pose estimation by proposing RAPTR, a method that uses weak supervision with only 3D bounding box and 2D keypoint labels, reducing joint position error by 34.3% on HIBER and 76.9% on MMVR datasets.
Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose \textbf{RAPTR} (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss. Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by $34.3\%$ on HIBER and $76.9\%$ on MMVR. Our implementation is available at https://github.com/merlresearch/radar-pose-transformer.