LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos
This work provides a more generalized method for human-object interaction detection in videos, particularly benefiting applications that lack depth maps or 3D human pose data.
The paper addresses human-object interaction (HOI) detection and anticipation in videos by introducing LIGHTEN, a hierarchical approach that learns visual features to capture spatio-temporal cues. It achieves state-of-the-art results of 88.9% and 92.6% on the CAD-120 dataset for HOI detection and anticipation, respectively.
Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video. It can be thought of as a specialized version of Visual Relationship Detection, wherein one of the objects must be a human. While traditional methods formulate the problem as inference on a sequence of video segments, we present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture spatio-temporal cues at multiple granularities in a video. Unlike current approaches, LIGHTEN avoids using ground truth data like depth maps or 3D human pose, thus increasing generalization across non-RGBD datasets as well. Furthermore, we achieve the same using only the visual features, instead of the commonly used hand-crafted spatial features. We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO dataset, setting a new benchmark for visual features based approaches. Code for LIGHTEN is available at https://github.com/praneeth11009/LIGHTEN-Learning-Interactions-with-Graphs-and-Hierarchical-TEmporal-Networks-for-HOI