CVJul 22, 2023

Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition

arXiv:2307.11973v28 citationsh-index: 72
Originality Incremental advance
AI Analysis

This work addresses a challenging task in human action recognition with applications in privacy-sensitive smart systems, but it is incremental as it builds on existing point cloud and transformer methods.

The paper tackles the problem of recognizing two-person interactions from videos by proposing a point cloud-based network that incorporates spatial, appearance, and motion information, achieving state-of-the-art performance on NTU RGB+D datasets.

As a fundamental aspect of human life, two-person interactions contain meaningful information about people's activities, relationships, and social settings. Human action recognition serves as the foundation for many smart applications, with a strong focus on personal privacy. However, recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to single-person actions. In this paper, we propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. To achieve this, we introduce a designed frame selection method named Interval Frame Sampling (IFS), which efficiently samples frames from videos, capturing more discriminative information in a relatively short processing time. Subsequently, a frame features learning module and a two-stream multi-level feature aggregation module extract global and partial features from the sampled frames, effectively representing the local-region spatial information, appearance information, and motion information related to the interactions. Finally, we apply a transformer to perform self-attention on the learned features for the final classification. Extensive experiments are conducted on two large-scale datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The results show that our network outperforms state-of-the-art approaches in most standard evaluation settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes