HOKEM: Human and Object Keypoint-based Extension Module for Human-Object Interaction Detection
This work addresses the challenge of accurately detecting interactions between humans and objects in images, which is crucial for semantic understanding, but it appears incremental as it builds upon existing detection models.
The paper tackled the problem of human-object interaction detection by proposing HOKEM, an extension module that improved accuracy through a novel object keypoint extraction method and a human-object adaptive graph convolutional network, achieving a large margin boost in performance on the V-COCO dataset.
Human-object interaction (HOI) detection for capturing relationships between humans and objects is an important task in the semantic understanding of images. When processing human and object keypoints extracted from an image using a graph convolutional network (GCN) to detect HOI, it is crucial to extract appropriate object keypoints regardless of the object type and to design a GCN that accurately captures the spatial relationships between keypoints. This paper presents the human and object keypoint-based extension module (HOKEM) as an easy-to-use extension module to improve the accuracy of the conventional detection models. The proposed object keypoint extraction method is simple yet accurately represents the shapes of various objects. Moreover, the proposed human-object adaptive GCN (HO-AGCN), which introduces adaptive graph optimization and attention mechanism, accurately captures the spatial relationships between keypoints. Experiments using the HOI dataset, V-COCO, showed that HOKEM boosted the accuracy of an appearance-based model by a large margin.