CVOct 11, 2022Code
ACRNet: Attention Cube Regression Network for Multi-view Real-time 3D Human Pose Estimation in TelemedicineBoce Hu, Chenfei Zhu, Xupeng Ai et al.
Human pose estimation (HPE) for 3D skeleton reconstruction in telemedicine has long received attention. Although the development of deep learning has made HPE methods in telemedicine simpler and easier to use, addressing low accuracy and high latency remains a big challenge. In this paper, we propose a novel multi-view Attention Cube Regression Network (ACRNet), which regresses the 3D position of joints in real time by aggregating informative attention points on each cube surface. More specially, a cube whose each surface contains uniformly distributed attention points with specific coordinate values is first created to wrap the target from the main view. Then, our network regresses the 3D position of each joint by summing and averaging the coordinates of attention points on each surface after being weighted. To verify our method, we first tested ACRNet on the open-source ITOP dataset; meanwhile, we collected a new multi-view upper body movement dataset (UBM) on the trunk support trainer (TruST) to validate the capability of our model in real rehabilitation scenarios. Experimental results demonstrate the superiority of ACRNet compared with other state-of-the-art methods. We also validate the efficacy of each module in ACRNet. Furthermore, Our work analyzes the performance of ACRNet under the medical monitoring indicator. Because of the high accuracy and running speed, our model is suitable for real-time telemedicine settings. The source code is available at https://github.com/BoceHu/ACRNet
ROFeb 3, 2025
Coarse-to-Fine 3D Keyframe TransporterXupeng Zhu, David Klee, Dian Wang et al.
Recent advances in Keyframe Imitation Learning (IL) have enabled learning-based agents to solve a diverse range of manipulation tasks. However, most approaches ignore the rich symmetries in the problem setting and, as a consequence, are sample-inefficient. This work identifies and utilizes the bi-equivariant symmetry within Keyframe IL to design a policy that generalizes to transformations of both the workspace and the objects grasped by the gripper. We make two main contributions: First, we analyze the bi-equivariance properties of the keyframe action scheme and propose a Keyframe Transporter derived from the Transporter Networks, which evaluates actions using cross-correlation between the features of the grasped object and the features of the scene. Second, we propose a computationally efficient coarse-to-fine SE(3) action evaluation scheme for reasoning the intertwined translation and rotation action. The resulting method outperforms strong Keyframe IL baselines by an average of >10% on a wide range of simulation tasks, and by an average of 55% in 4 physical experiments.
ROOct 24, 2025
Generalizable Hierarchical Skill Learning via Object-Centric RepresentationHaibo Zhao, Yu Qi, Boce Hu et al.
We present Generalizable Hierarchical Skill Learning (GSL), a novel framework for hierarchical policy learning that significantly improves policy generalization and sample efficiency in robot manipulation. One core idea of GSL is to use object-centric skills as an interface that bridges the high-level vision-language model and the low-level visual-motor policy. Specifically, GSL decomposes demonstrations into transferable and object-canonicalized skill primitives using foundation models, ensuring efficient low-level skill learning in the object frame. At test time, the skill-object pairs predicted by the high-level agent are fed to the low-level module, where the inferred canonical actions are mapped back to the world frame for execution. This structured yet flexible design leads to substantial improvements in sample efficiency and generalization of our method across unseen spatial arrangements, object appearances, and task compositions. In simulation, GSL trained with only 3 demonstrations per task outperforms baselines trained with 30 times more data by 15.5 percent on unseen tasks. In real-world experiments, GSL also surpasses the baseline trained with 10 times more data.