Implicit Affordance Acquisition via Causal Action-Effect Modeling in the Video Domain
This work addresses the challenge of learning commonsense affordances for AI systems, though it is incremental as it builds on existing instructional video resources and pretraining methods.
The paper tackles the problem of acquiring affordance knowledge from visual data by creating the Causal Action-Effect dataset and designing Masked Action Modeling and Masked Effect Modeling pretraining tasks, resulting in a model that outperforms strong visual-linguistic and linguistic models on a zero-shot physical reasoning task.
Affordance knowledge is a fundamental aspect of commonsense knowledge. Recent findings indicate that world knowledge emerges through large-scale self-supervised pretraining, motivating our exploration of acquiring affordance knowledge from the visual domain. To this end, we augment an existing instructional video resource to create the new Causal Action-Effect (CAE) dataset and design two novel pretraining tasks -- Masked Action Modeling (MAM) and Masked Effect Modeling (MEM) -- promoting the acquisition of two affordance properties in models: behavior and entity equivalence, respectively. We empirically demonstrate the effectiveness of our proposed methods in learning affordance properties. Furthermore, we show that a model pretrained on both tasks outperforms a strong image-based visual-linguistic foundation model (FLAVA) as well as pure linguistic models on a zero-shot physical reasoning probing task.