CV AIMar 10, 2022

End-to-End Semantic Video Transformer for Zero-Shot Action Recognition

arXiv:2203.05156v23.72 citationsh-index: 16Has Code

Originality Highly original

AI Analysis

This work addresses the problem of recognizing unseen action classes in videos for computer vision applications, representing an incremental advance in zero-shot learning.

The authors tackled zero-shot action recognition by proposing an end-to-end transformer model that captures long-range spatiotemporal dependencies, outperforming state-of-the-art methods with top-1 accuracy improvements on UCF-101, HMDB-51, and ActivityNet datasets.

While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction. In this work, we propose a novel end-to-end trained transformer model which is capable of capturing long range spatiotemporal dependencies efficiently, contrary to existing approaches which use 3D-CNNs. Moreover, to address a common ambiguity in the existing works about classes that can be considered as previously unseen, we propose a new experimentation setup that satisfies the zero-shot learning premise for action recognition by avoiding overlap between the training and testing classes. The proposed approach significantly outperforms the state of the arts in zero-shot action recognition in terms of the the top-1 accuracy on UCF-101, HMDB-51 and ActivityNet datasets. The code and proposed experimentation setup are available in GitHub: https://github.com/Secure-and-Intelligent-Systems-Lab/SemanticVideoTransformer

View on arXiv PDF Code

Similar