CVAIApr 8, 2024

Self-Explainable Affordance Learning with Embodied Caption

arXiv:2404.05603v18 citationsh-index: 28
Originality Incremental advance
AI Analysis

This addresses the problem of action ambiguity and error correction in robotic tasks, though it appears incremental as it builds on existing affordance learning methods.

The paper tackles action ambiguity in visual affordance learning by introducing Self-Explainable Affordance learning (SEA) with embodied caption, enabling robots to articulate intentions and bridging explainable vision-language caption with affordance learning, supported by a new dataset and model.

In the field of visual affordance learning, previous methods mainly used abundant images or videos that delineate human behavior patterns to identify action possibility regions for object manipulation, with a variety of applications in robotic tasks. However, they encounter a main challenge of action ambiguity, illustrated by the vagueness like whether to beat or carry a drum, and the complexities involved in processing intricate scenes. Moreover, it is important for human intervention to rectify robot errors in time. To address these issues, we introduce Self-Explainable Affordance learning (SEA) with embodied caption. This innovation enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. Due to a lack of appropriate dataset, we unveil a pioneering dataset and metrics tailored for this task, which integrates images, heatmaps, and embodied captions. Furthermore, we propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner. Extensive quantitative and qualitative experiments demonstrate our method's effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes