LSTA-Net: Long short-term Spatio-Temporal Aggregation Network for Skeleton-based Action Recognition
This addresses a key bottleneck in action recognition for applications like surveillance or human-computer interaction, but it is incremental as it builds on existing graph-based methods.
The paper tackles the problem of capturing long-range spatio-temporal dependencies in skeleton-based action recognition, proposing LSTA-Net, which achieves higher results than state-of-the-art methods on three benchmark datasets.
Modelling various spatio-temporal dependencies is the key to recognising human actions in skeleton sequences. Most existing methods excessively relied on the design of traversal rules or graph topologies to draw the dependencies of the dynamic joints, which is inadequate to reflect the relationships of the distant yet important joints. Furthermore, due to the locally adopted operations, the important long-range temporal information is therefore not well explored in existing works. To address this issue, in this work we propose LSTA-Net: a novel Long short-term Spatio-Temporal Aggregation Network, which can effectively capture the long/short-range dependencies in a spatio-temporal manner. We devise our model into a pure factorised architecture which can alternately perform spatial feature aggregation and temporal feature aggregation. To improve the feature aggregation effect, a channel-wise attention mechanism is also designed and employed. Extensive experiments were conducted on three public benchmark datasets, and the results suggest that our approach can capture both long-and-short range dependencies in the space and time domain, yielding higher results than other state-of-the-art methods. Code available at https://github.com/tailin1009/LSTA-Net.