CVSep 16, 2025

Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

arXiv:2509.12554v2h-index: 13
Originality Incremental advance
AI Analysis

This addresses the challenge of recognizing interactions in HOI detection for computer vision applications, representing an incremental improvement over transformer-based methods.

The paper tackles the problem of Human-Object Interaction (HOI) detection by proposing a Multimodal Graph Network Modeling (MGNM) method that explicitly models relational structures using graph neural networks, achieving state-of-the-art performance on HICO-DET and V-COCO benchmarks.

Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level visual and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art (SOTA) performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes