LGCVIVJan 7, 2020

Visual-Semantic Graph Attention Networks for Human-Object Interaction Detection

arXiv:2001.02302v614 citationsHas Code
AI Analysis

This work addresses scene understanding for robotics by improving HOI detection, but it is incremental as it builds on existing graph-based methods with specific enhancements.

The paper tackled the problem of Human-Object Interaction (HOI) Detection by proposing a dual-graph attention network that aggregates contextual visual, spatial, and semantic information from primary and subsidiary relations, achieving comparable results on V-COCO and HICO-DET benchmarks.

In scene understanding, robotics benefit from not only detecting individual scene instances but also from learning their possible interactions. Human-Object Interaction (HOI) Detection infers the action predicate on a <human, predicate, object> triplet. Contextual information has been found critical in inferring interactions. However, most works only use local features from single human-object pair for inference. Few works have studied the disambiguating contribution of subsidiary relations made available via graph networks. Similarly, few have learned to effectively leverage visual cues along with the intrinsic semantic regularities contained in HOIs. We contribute a dual-graph attention network that effectively aggregates contextual visual, spatial, and semantic information dynamically from primary human-object relations as well as subsidiary relations through attention mechanisms for strong disambiguating power. We achieve comparable results on two benchmarks: V-COCO and HICO-DET. Code is available at \url{https://github.com/birlrobotics/vs-gats}.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes