Skeleton-Based Action Recognition with Spatial-Structural Graph Convolution
This addresses fine-grained human activity recognition for computer vision applications, but appears incremental as it builds on existing GCN approaches.
The paper tackles the problems of limited information aggregation at edge nodes and over-smoothing in Graph Convolutional Networks for skeleton-based action recognition, proposing a two-stream Spatial-Structural GCN method that achieves good results on NTU RGB+D datasets.
Human Activity Recognition (HAR) is a field of study that focuses on identifying and classifying human activities. Skeleton-based Human Activity Recognition has received much attention in recent years, where Graph Convolutional Network (GCN) based method is widely used and has achieved remarkable results. However, the representation of skeleton data and the issue of over-smoothing in GCN still need to be studied. 1). Compared to central nodes, edge nodes can only aggregate limited neighbor information, and different edge nodes of the human body are always structurally related. However, the information from edge nodes is crucial for fine-grained activity recognition. 2). The Graph Convolutional Network suffers from a significant over-smoothing issue, causing nodes to become increasingly similar as the number of network layers increases. Based on these two ideas, we propose a two-stream graph convolution method called Spatial-Structural GCN (SpSt-GCN). Spatial GCN performs information aggregation based on the topological structure of the human body, and structural GCN performs differentiation based on the similarity of edge node sequences. The spatial connection is fixed, and the human skeleton naturally maintains this topology regardless of the actions performed by humans. However, the structural connection is dynamic and depends on the type of movement the human body is performing. Based on this idea, we also propose an entirely data-driven structural connection, which greatly increases flexibility. We evaluate our method on two large-scale datasets, i.e., NTU RGB+D and NTU RGB+D 120. The proposed method achieves good results while being efficient.