A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation
This work addresses the challenge of generating stable and realistic human-object interactions for robotics and animation, though it is incremental as it builds on existing transformer and pretraining methods.
The paper tackles the problem of generating realistic whole-body grasping motions by introducing a transformer-based framework that addresses pose generation and motion infilling, achieving improved coherence, stability, and visual realism on the GRAB dataset compared to state-of-the-art baselines.
Accepted in the ICIP 2025 We present a novel transformer-based framework for whole-body grasping that addresses both pose generation and motion infilling, enabling realistic and stable object interactions. Our pipeline comprises three stages: Grasp Pose Generation for full-body grasp generation, Temporal Infilling for smooth motion continuity, and a LiftUp Transformer that refines downsampled joints back to high-resolution markers. To overcome the scarcity of hand-object interaction data, we introduce a data-efficient Generalized Pretraining stage on large, diverse motion datasets, yielding robust spatio-temporal representations transferable to grasping tasks. Experiments on the GRAB dataset show that our method outperforms state-of-the-art baselines in terms of coherence, stability, and visual realism. The modular design also supports easy adaptation to other human-motion applications.