CVMay 12

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Qirui Wang, Jingyi He, Yining Pan, Xulei Yang, Shijie Li

arXiv:2605.1161615.6

Predicted impact top 54% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For embodied agents needing to localize small, ambiguous actionable regions on objects, this provides a training-free memory-based approach that outperforms existing pipelines.

AFFORDMEM grounds 3D functional affordances by using cross-scene memory (category-level RGB images with affordance overlays) and in-scene spatial memory (scene graph of instances) to guide a frozen VLM, improving AP50 by 3.23 and 3.7 on SceneFun3D splits over prior training-free methods.

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.

View on arXiv PDF

Similar