AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
This work addresses affordance learning for scene-level applications in robotics or AR/VR, but it appears incremental as it builds on existing geometric and visual approaches with a new dataset and matching method.
The authors tackled the challenge of affordance learning in 3D scenes by introducing AffordBridge, a large-scale dataset with 291,637 annotations across 685 indoor scenes, and AffordMatcher, a method that matches image and point cloud instances to identify affordance regions more precisely, showing effectiveness in experiments.
Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward. In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within the scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that establishes coherent semantic correspondences between image-based and point cloud-based instances for keypoint matching, enabling a more precise identification of affordance regions based on cues, so-called visual signifiers. Experimental results on our dataset demonstrate the effectiveness of our approach compared to other methods.