CV LGOct 14, 2022

MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity Recognition

Ziqi Gao, Yuntao Wang, Jianguo Chen, Junliang Xing, Shwetak Patel, Xin Liu, Yuanchun Shi

arXiv:2210.09222v23.711 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

This work addresses efficiency challenges in multimodal human activity recognition for applications like wearable or edge computing, though it is incremental as it builds on existing multimodal and attention-based approaches.

The paper tackles the problem of high computational load in multimodal human activity recognition by proposing the Multimodal Temporal Segment Attention Network (MMTSA), which achieves an 11.13% improvement in cross-subject F1-score on the MMAct dataset compared to previous state-of-the-art methods while reducing computational load and inference latency on edge devices.

Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our method achieves superior performance improvements 11.13% of cross-subject F1-score on the MMAct dataset than the previous state-of-the-art (SOTA) methods. The ablation study and analysis suggest that MMTSA's effectiveness in fusing multimodal data for accurate HAR. The efficiency evaluation on an edge device showed that MMTSA achieved significantly better accuracy, lower computational load, and lower inference latency than SOTA methods.

View on arXiv PDF Code

Similar