CVAIMay 9, 2024

A Survey on Backbones for Deep Video Action Recognition

arXiv:2405.05584v14 citations2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)
Originality Synthesis-oriented
AI Analysis

It provides a survey for researchers in computer vision, but is incremental as it summarizes existing work without novel contributions.

This paper reviews deep learning methods for video action recognition, categorizing them into two-stream networks, 3D convolutional networks, and transformer-based approaches, without presenting new experimental results or concrete performance numbers.

Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes