OVGrasp: Open-Vocabulary Grasping Assistance via Multimodal Intent Detection
This work addresses the challenge of restoring autonomy for people with motor impairments by enabling robust, open-vocabulary grasp assistance in unpredictable settings, representing a domain-specific advancement.
The paper tackles the problem of providing grasping assistance for individuals with motor impairments in unstructured environments by developing OVGrasp, a hierarchical control framework that integrates multimodal inputs like RGB-D vision and voice commands, achieving a grasping ability score of 87.00% in evaluations with ten participants.
Grasping assistance is essential for restoring autonomy in individuals with motor impairments, particularly in unstructured environments where object categories and user intentions are diverse and unpredictable. We present OVGrasp, a hierarchical control framework for soft exoskeleton-based grasp assistance that integrates RGB-D vision, open-vocabulary prompts, and voice commands to enable robust multimodal interaction. To enhance generalization in open environments, OVGrasp incorporates a vision-language foundation model with an open-vocabulary mechanism, allowing zero-shot detection of previously unseen objects without retraining. A multimodal decision-maker further fuses spatial and linguistic cues to infer user intent, such as grasp or release, in multi-object scenarios. We deploy the complete framework on a custom egocentric-view wearable exoskeleton and conduct systematic evaluations on 15 objects across three grasp types. Experimental results with ten participants demonstrate that OVGrasp achieves a grasping ability score (GAS) of 87.00%, outperforming state-of-the-art baselines and achieving improved kinematic alignment with natural hand motion.