Learning Mutual Excitation for Hand-to-Hand and Human-to-Human Interaction Recognition
This addresses the need for better interactive action recognition in video analysis and human-robot interaction, though it is incremental as it builds upon existing graph convolution methods.
The paper tackled the problem of recognizing interactive actions like hand-to-hand and human-to-human interactions by proposing a mutual excitation graph convolutional network (me-GCN) that models mutual semantic relationships between entities, achieving state-of-the-art performance on datasets such as Assembely101, NTU60-Interaction, and NTU120-Interaction.
Recognizing interactive actions, including hand-to-hand interaction and human-to-human interaction, has attracted increasing attention for various applications in the field of video analysis and human-robot interaction. Considering the success of graph convolution in modeling topology-aware features from skeleton data, recent methods commonly operate graph convolution on separate entities and use late fusion for interactive action recognition, which can barely model the mutual semantic relationships between pairwise entities. To this end, we propose a mutual excitation graph convolutional network (me-GCN) by stacking mutual excitation graph convolution (me-GC) layers. Specifically, me-GC uses a mutual topology excitation module to firstly extract adjacency matrices from individual entities and then adaptively model the mutual constraints between them. Moreover, me-GC extends the above idea and further uses a mutual feature excitation module to extract and merge deep features from pairwise entities. Compared with graph convolution, our proposed me-GC gradually learns mutual information in each layer and each stage of graph convolution operations. Extensive experiments on a challenging hand-to-hand interaction dataset, i.e., the Assembely101 dataset, and two large-scale human-to-human interaction datasets, i.e., NTU60-Interaction and NTU120-Interaction consistently verify the superiority of our proposed method, which outperforms the state-of-the-art GCN-based and Transformer-based methods.