CVMay 12

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

arXiv:2605.1161615.6
Predicted impact top 54% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For embodied agents needing to localize small, ambiguous actionable regions on objects, this provides a training-free memory-based approach that outperforms existing pipelines.

AFFORDMEM grounds 3D functional affordances by using cross-scene memory (category-level RGB images with affordance overlays) and in-scene spatial memory (scene graph of instances) to guide a frozen VLM, improving AP50 by 3.23 and 3.7 on SceneFun3D splits over prior training-free methods.

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes