CV AI LG MMAug 1, 2023

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, Naveed Akhtar

arXiv:2308.03741v15.97 citationsh-index: 32

Originality Incremental advance

AI Analysis

This addresses action recognition for applications like surveillance or human-computer interaction, but appears incremental as it builds on existing multimodal approaches.

The paper tackles multimodal human action recognition by proposing MAiVAR-T, a model that integrates audio and video modalities, and reports superior performance compared to state-of-the-art methods on a benchmark dataset.

In line with the human capacity to perceive the world by simultaneously processing and integrating high-dimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast to existing state-of-the-art strategies that focus solely on audio or video modalities, MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance. This underscores the potential enhancements derived from integrating audio and video modalities for action recognition purposes.

View on arXiv PDF

Similar