CVNov 10, 2022

SWTF: Sparse Weighted Temporal Fusion for Drone-Based Activity Recognition

Santosh Kumar Yadav, Esha Pahwa, Achleshwar Luthra, Kamlesh Tiwari, Hari Mohan Pandey, Peter Corcoran

arXiv:2211.05531v11.44 citationsh-index: 34

Originality Incremental advance

AI Analysis

This work addresses the challenge of recognizing human activities from drone footage, which is important for applications like surveillance and sports analysis, and it represents an incremental improvement by optimizing existing deep CNN architectures.

The paper tackles the problem of drone-based human activity recognition by proposing a Sparse Weighted Temporal Fusion (SWTF) module, which achieves accuracies of 72.76%, 92.56%, and 78.86% on benchmark datasets, surpassing previous state-of-the-art performances.

Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community in the past few years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video frames for obtaining global weighted temporal fusion outcome. The proposed SWTF is divided into two components. First, a temporal segment network that sparsely samples a given set of frames. Second, weighted temporal fusion, that incorporates a fusion of feature maps derived from optical flow, with raw RGB images. This is followed by base-network, which comprises a convolutional neural network module along with fully connected layers that provide us with activity recognition. The SWTF network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the-art performances by a significant margin.

View on arXiv PDF

Similar