CVNov 4, 2024

ARN-LSTM: A Multi-Stream Fusion Model for Skeleton-based Action Recognition

arXiv:2411.01769v2h-index: 3
Originality Incremental advance
AI Analysis

This work addresses the problem of improving action recognition accuracy for applications like surveillance or human-computer interaction, but it appears incremental as it builds on existing multi-stream and LSTM-based approaches.

The paper tackles the challenge of capturing both spatial motion and temporal dynamics in skeleton-based action recognition by proposing the ARN-LSTM architecture, which integrates joint, motion, and temporal information through a multi-stream fusion model and shows superior performance on NTU RGB+D datasets, especially for group activities.

This paper presents the ARN-LSTM architecture, a novel multi-stream action recognition model designed to address the challenge of simultaneously capturing spatial motion and temporal dynamics in action sequences. Traditional methods often focus solely on spatial or temporal features, limiting their ability to comprehend complex human activities fully. Our proposed model integrates joint, motion, and temporal information through a multi-stream fusion architecture. Specifically, it comprises a jointstream for extracting skeleton features, a temporal stream for capturing dynamic temporal features, and an ARN-LSTM block that utilizes Time-Distributed Long Short-Term Memory (TD-LSTM) layers followed by an Attention Relation Network (ARN) to model temporal relations. The outputs from these streams are fused in a fully connected layer to provide the final action prediction. Evaluations on the NTU RGB+D 60 and NTU RGB+D 120 datasets outperform the superior performance of our model, particularly in group activity recognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes