CVGRDec 16, 2020

Temporal Graph Modeling for Skeleton-based Action Recognition

arXiv:2012.08804v15 citations
AI Analysis

This work provides an incremental improvement in temporal modeling for skeleton-based action recognition, which is relevant for applications like human-computer interaction and surveillance.

This paper addresses the limitation of existing Graph Convolutional Networks (GCNs) in fully exploring the temporal dynamics of skeleton sequences for action recognition. The proposed Temporal Enhanced Graph Convolutional Network (TE-GCN) constructs a temporal relation graph to capture complex temporal dynamics, including relations between non-adjacent time steps, and achieves state-of-the-art performance on NTU-60 RGB+D and NTU-120 RGB+D datasets.

Graph Convolutional Networks (GCNs), which model skeleton data as graphs, have obtained remarkable performance for skeleton-based action recognition. Particularly, the temporal dynamic of skeleton sequence conveys significant information in the recognition task. For temporal dynamic modeling, GCN-based methods only stack multi-layer 1D local convolutions to extract temporal relations between adjacent time steps. With the repeat of a lot of local convolutions, the key temporal information with non-adjacent temporal distance may be ignored due to the information dilution. Therefore, these methods still remain unclear how to fully explore temporal dynamic of skeleton sequence. In this paper, we propose a Temporal Enhanced Graph Convolutional Network (TE-GCN) to tackle this limitation. The proposed TE-GCN constructs temporal relation graph to capture complex temporal dynamic. Specifically, the constructed temporal relation graph explicitly builds connections between semantically related temporal features to model temporal relations between both adjacent and non-adjacent time steps. Meanwhile, to further explore the sufficient temporal dynamic, multi-head mechanism is designed to investigate multi-kinds of temporal relations. Extensive experiments are performed on two widely used large-scale datasets, NTU-60 RGB+D and NTU-120 RGB+D. And experimental results show that the proposed model achieves the state-of-the-art performance by making contribution to temporal modeling for action recognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes