RO CVMay 17, 2025

GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation

Teli Ma, Jia Zheng, Zifan Wang, Ziyao Gao, Jiaming Zhou, Junwei Liang

arXiv:2505.11865v118.716 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses the problem of enabling robots to learn manipulation skills from human videos for researchers in robotics and AI, with incremental contributions through a new dataset and framework.

The paper tackles the challenge of transferring manipulation skills from human demonstrations to robots by addressing the lack of large-scale affordance-annotated datasets and insufficient exploration of diverse contexts, introducing HOVA-500K (500,000 images across 1,726 object categories and 675 actions) and the GLOVER++ framework, which achieves state-of-the-art results on the HOVA-500K benchmark and demonstrates strong generalization in robotic manipulation tasks.

Learning manipulation skills from human demonstration videos offers a promising path toward generalizable and interpretable robotic intelligence-particularly through the lens of actionable affordances. However, transferring such knowledge remains challenging due to: 1) a lack of large-scale datasets with precise affordance annotations, and 2) insufficient exploration of affordances in diverse manipulation contexts. To address these gaps, we introduce HOVA-500K, a large-scale, affordance-annotated dataset comprising 500,000 images across 1,726 object categories and 675 actions. We also release a standardized benchmarking suite for multi-modal affordance reasoning. Built upon HOVA-500K, we present GLOVER++, a global-to-local affordance training framework that effectively transfers actionable affordance knowledge from human demonstrations to downstream open-vocabulary reasoning tasks. GLOVER++ achieves state-of-the-art results on the HOVA-500K benchmark and demonstrates strong generalization across diverse downstream robotic manipulation tasks. By explicitly modeling actionable affordances, GLOVER++ facilitates robust transfer across scenes, modalities, and tasks. We hope that HOVA-500K and the GLOVER++ framework will serve as valuable resources for bridging the gap between human demonstrations and robotic manipulation capabilities.

View on arXiv PDF

Similar