Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors
This addresses the problem of locating actions and functions on objects for robotics and human-computer interaction, representing an incremental advance by enhancing baseline models with semantic priors.
The paper tackles weakly supervised affordance grounding by training a model to identify affordance regions on objects using human-object interaction and egocentric images without dense labels, achieving a breakthrough improvement over existing methods.
In this work, we focus on the task of weakly supervised affordance grounding, where a model is trained to identify affordance regions on objects using human-object interaction images and egocentric object images without dense labels. Previous works are mostly built upon class activation maps, which are effective for semantic segmentation but may not be suitable for locating actions and functions. Leveraging recent advanced foundation models, we develop a supervised training pipeline based on pseudo labels. The pseudo labels are generated from an off-the-shelf part segmentation model, guided by a mapping from affordance to part names. Furthermore, we introduce three key enhancements to the baseline model: a label refining stage, a fine-grained feature alignment process, and a lightweight reasoning module. These techniques harness the semantic knowledge of static objects embedded in off-the-shelf foundation models to improve affordance learning, effectively bridging the gap between objects and actions. Extensive experiments demonstrate that the performance of the proposed model has achieved a breakthrough improvement over existing methods. Our codes are available at https://github.com/woyut/WSAG-PLSP .