CVSep 10, 2019

Reasoning About Human-Object Interactions Through Dual Attention Networks

Tete Xiao, Quanfu Fan, Dan Gutfreund, Mathew Monfort, Aude Oliva, Bolei Zhou

arXiv:1909.04743v112.036 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of understanding human-object interactions in videos, which is important for applications in robotics and computer vision, though it appears incremental as it builds on existing attention mechanisms.

The authors tackled the problem of recognizing human-object interactions by proposing a Dual Attention Network that weights important features for objects and actions, achieving competitive classification performance on the Something-Something dataset and enabling weak spatiotemporal localization and affordance segmentation with only video-level labels.

Objects are entities we act upon, where the functionality of an object is determined by how we interact with it. In this work we propose a Dual Attention Network model which reasons about human-object interactions. The dual-attentional framework weights the important features for objects and actions respectively. As a result, the recognition of objects and actions mutually benefit each other. The proposed model shows competitive classification performance on the human-object interaction dataset Something-Something. Besides, it can perform weak spatiotemporal localization and affordance segmentation, despite being trained only with video-level labels. The model not only finds when an action is happening and which object is being manipulated, but also identifies which part of the object is being interacted with. Project page: \url{https://dual-attention-network.github.io/}.

View on arXiv PDF

Similar