CVSep 10, 2023

Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition

arXiv:2309.05032v11.51 citationsh-index: 8

Originality Highly original

AI Analysis

This work addresses robust action recognition for applications like surveillance or healthcare by fusing multimodal sensor data, representing an incremental improvement through a novel fusion architecture.

The paper tackles multimodal human action recognition by introducing the Unified Contrastive Fusion Transformer (UCFFormer) to integrate data from diverse sensors, achieving state-of-the-art performance on UTD-MHAD and NTU RGB+D datasets with considerable margins over competing methods.

Various types of sensors have been considered to develop human action recognition (HAR) models. Robust HAR performance can be achieved by fusing multimodal data acquired by different sensors. In this paper, we introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer) designed to integrate data with diverse distributions to enhance HAR performance. Based on the embedding features extracted from each modality, UCFFormer employs the Unified Transformer to capture the inter-dependency among embeddings in both time and modality domains. We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer. UCFFormer also incorporates contrastive learning to reduce the discrepancy in feature distributions across various modalities, thus generating semantically aligned features for information fusion. Performance evaluation conducted on two popular datasets, UTD-MHAD and NTU RGB+D, demonstrates that UCFFormer achieves state-of-the-art performance, outperforming competing methods by considerable margins.

View on arXiv PDF

Similar