CV AIJul 15, 2021

STAR: Sparse Transformer-based Action Recognition

Feng Shi, Chonghan Lee, Liang Qiu, Yizhou Zhao, Tianyi Shen, Shivran Muralidhar, Tian Han, Song-Chun Zhu, Vijaykrishnan Narayanan

arXiv:2107.07089v18.736 citationsHas Code

Originality Highly original

AI Analysis

This work addresses the problem of low training and inference efficiency in action recognition for researchers and practitioners, offering a more efficient alternative to existing models.

The paper tackles the inefficiency of dense graph convolution networks in skeleton-based human action recognition by proposing a sparse Transformer-based model, achieving 4-18x speedup and 1/7-1/15 model size reduction while maintaining competitive accuracy.

The cognitive system for human action and behavior has evolved into a deep learning regime, and especially the advent of Graph Convolution Networks has transformed the field in recent years. However, previous works have mainly focused on over-parameterized and complex models based on dense graph convolution networks, resulting in low efficiency in training and inference. Meanwhile, the Transformer architecture-based model has not yet been well explored for cognitive application in human action and behavior estimation. This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Our model can also process the variable length of video clips grouped as a single batch. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference. Experiments show that our model achieves 4~18x speedup and 1/7~1/15 model size compared with the baseline models at competitive accuracy.

View on arXiv PDF Code

Similar