CVSep 6, 2024
Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and DynamicsWoojin Cho, Jihyun Lee, Minjae Yi et al.
Existing datasets for 3D hand-object interaction are limited either in the data cardinality, data variations in interaction scenarios, or the quality of annotations. In this work, we present a comprehensive new training dataset for hand-object interaction called HOGraspNet. It is the only real dataset that captures full grasp taxonomies, providing grasp annotation and wide intraclass variations. Using grasp taxonomies as atomic actions, their space and time combinatorial can represent complex hand activities around objects. We select 22 rigid objects from the YCB dataset and 8 other compound objects using shape and size taxonomies, ensuring coverage of all hand grasp configurations. The dataset includes diverse hand shapes from 99 participants aged 10 to 74, continuous video frames, and a 1.5M RGB-Depth of sparse frames with annotations. It offers labels for 3D hand and object meshes, 3D keypoints, contact maps, and \emph{grasp labels}. Accurate hand and object 3D meshes are obtained by fitting the hand parametric model (MANO) and the hand implicit function (HALO) to multi-view RGBD frames, with the MoCap system only for objects. Note that HALO fitting does not require any parameter tuning, enabling scalability to the dataset's size with comparable accuracy to MANO. We evaluate HOGraspNet on relevant tasks: grasp classification and 3D hand pose estimation. The result shows performance variations based on grasp type and object class, indicating the potential importance of the interaction space captured by our dataset. The provided data aims at learning universal shape priors or foundation models for 3D hand-object interaction. Our dataset and code are available at https://hograspnet2024.github.io/.
CVMar 9
Int3DNet: Scene-Motion Cross Attention Network for 3D Intention Prediction in Mixed RealityTaewook Ha, Woojin Cho, Dooyoung Kim et al.
We propose Int3DNet, a scene-aware network that predicts 3D intention areas directly from scene geometry and head-hand motion cues, enabling robust human intention prediction without explicit object-level perception. In Mixed Reality (MR), intention prediction is critical as it enables the system to anticipate user actions and respond proactively, reducing interaction delays and ensuring seamless user experiences. Our method employs a cross attention fusion of sparse motion cues and scene point clouds, offering a novel approach that directly interprets the user's spatial intention within the scene. We evaluated Int3DNet on MoGaze and CIRCLE datasets, which are public datasets for full-body human-scene interactions, showing consistent performance across time horizons of up to 1500 ms and outperforming the baselines, even in diverse and unseen scenes. Moreover, we demonstrate the usability of proposed method through a demonstration of efficient visual question answering (VQA) based on intention areas. Int3DNet provides reliable 3D intention areas derived from head-hand motion and scene geometry, thus enabling seamless interaction between humans and MR systems through proactive processing of intention areas.
HCMar 8
Task Breakpoint Generation using Origin-Centric Graph in Virtual Reality Recordings for Adaptive PlaybackSelin Choi, Dooyoung Kim, Taewook Ha et al.
We propose a method for generating task breakpoints based on an Origin-Centric Graph (OCG) to segment goal-oriented activity recordings into task units for adaptive playback in Virtual Reality (VR) environments. With the development of Augmented Reality (AR)/VR head-mounted displays (HMDs), research on adaptive tutorials and authoring tools has become active, but existing task segmentation methods mainly rely on manual annotation or are restricted to 2D video which limits their applicability to 3D VR contexts. In our approach, assembly scenarios with clearly defined task boundaries are recorded using a structured spatio-temporal scene graph (STSG), and the OCG is employed to track changes in the central object and the formation of new groups, thereby generating task breakpoints automatically. A user study collected user-perceived task breakpoints to establish ground truth (GT), and comparison with the algorithm-detected breakpoints demonstrated high agreement and confirmed accuracy in supporting adaptive playback. The proposed task segmentation method provides a foundation for dynamically adjusting VR playback according to user proficiency and progress, with potential for extension into automatic timeline segmentation systems for diverse VR recordings.