CVAIJul 15, 2021

STAR: Sparse Transformer-based Action Recognition

arXiv:2107.07089v136 citations
Originality Highly original
AI Analysis

This work addresses the problem of low training and inference efficiency in action recognition for researchers and practitioners, offering a more efficient alternative to existing models.

The paper tackles the inefficiency of dense graph convolution networks in skeleton-based human action recognition by proposing a sparse Transformer-based model, achieving 4-18x speedup and 1/7-1/15 model size reduction while maintaining competitive accuracy.

The cognitive system for human action and behavior has evolved into a deep learning regime, and especially the advent of Graph Convolution Networks has transformed the field in recent years. However, previous works have mainly focused on over-parameterized and complex models based on dense graph convolution networks, resulting in low efficiency in training and inference. Meanwhile, the Transformer architecture-based model has not yet been well explored for cognitive application in human action and behavior estimation. This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Our model can also process the variable length of video clips grouped as a single batch. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference. Experiments show that our model achieves 4~18x speedup and 1/7~1/15 model size compared with the baseline models at competitive accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes