CVFeb 24

VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

arXiv:2602.20608v11 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses the challenge of accurately localizing contact regions for embodied visual reasoning, though it is incremental as it builds on existing affordance grounding with a novel video-based approach.

The paper tackles the problem of 3D object affordance grounding by leveraging dynamic human-object interaction videos instead of static cues, resulting in VAGNet achieving state-of-the-art performance on the new PVAD dataset.

3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes