CV AIFeb 3

JRDB-Pose3D: A Multi-person 3D Human Pose and Shape Estimation Dataset for Robotics

Sandika Biswas, Kian Izadpanah, Hamid Rezatofighi

arXiv:2602.03064v11.5h-index: 5

Originality Synthesis-oriented

AI Analysis

This dataset addresses the problem of limited real-world applicability for robotics applications like autonomous driving and human-robot interaction, though it is incremental as it builds upon existing datasets.

The paper tackles the lack of multi-person 3D human pose datasets for real-world robotics by introducing JRDB-Pose3D, which provides SMPL-based pose annotations for 5-10 people per frame on average, with up to 35 individuals in complex indoor and outdoor scenes.

Real-world scenes are inherently crowded. Hence, estimating 3D poses of all nearby humans, tracking their movements over time, and understanding their activities within social and environmental contexts are essential for many applications, such as autonomous driving, robot perception, robot navigation, and human-robot interaction. However, most existing 3D human pose estimation datasets primarily focus on single-person scenes or are collected in controlled laboratory environments, which restricts their relevance to real-world applications. To bridge this gap, we introduce JRDB-Pose3D, which captures multi-human indoor and outdoor environments from a mobile robotic platform. JRDB-Pose3D provides rich 3D human pose annotations for such complex and dynamic scenes, including SMPL-based pose annotations with consistent body-shape parameters and track IDs for each individual over time. JRDB-Pose3D contains, on average, 5-10 human poses per frame, with some scenes featuring up to 35 individuals simultaneously. The proposed dataset presents unique challenges, including frequent occlusions, truncated bodies, and out-of-frame body parts, which closely reflect real-world environments. Moreover, JRDB-Pose3D inherits all available annotations from the JRDB dataset, such as 2D pose, information about social grouping, activities, and interactions, full-scene semantic masks with consistent human- and object-level tracking, and detailed annotations for each individual, such as age, gender, and race, making it a holistic dataset for a wide range of downstream perception and human-centric understanding tasks.

View on arXiv PDF

Similar