CVJan 13, 2025

Aligning First, Then Fusing: A Novel Weakly Supervised Multimodal Violence Detection Method

arXiv:2501.07496v210 citationsh-index: 4Has CodeKnowledge-Based Systems

Originality Incremental advance

AI Analysis

This work addresses violence detection for video analysis applications, representing an incremental improvement over existing multimodal fusion methods.

The paper tackles weakly supervised violence detection in videos by proposing a novel multimodal semantic feature alignment method that maps less informative modalities into the RGB feature space, achieving an average precision of 86.07% on the XD-Violence dataset.

Weakly supervised violence detection refers to the technique of training models to identify violent segments in videos using only video-level labels. Among these approaches, multimodal violence detection, which integrates modalities such as audio and optical flow, holds great potential. Existing methods in this domain primarily focus on designing multimodal fusion models to address modality discrepancies. In contrast, we take a different approach; leveraging the inherent discrepancies across modalities in violence event representation to propose a novel multimodal semantic feature alignment method. This method sparsely maps the semantic features of local, transient, and less informative modalities ( such as audio and optical flow ) into the more informative RGB semantic feature space. Through an iterative process, the method identifies the suitable no-zero feature matching subspace and aligns the modality-specific event representations based on this subspace, enabling the full exploitation of information from all modalities during the subsequent modality fusion stage. Building on this, we design a new weakly supervised violence detection framework that consists of unimodal multiple-instance learning for extracting unimodal semantic features, multimodal alignment, multimodal fusion, and final detection. Experimental results on benchmark datasets demonstrate the effectiveness of our method, achieving an average precision (AP) of 86.07% on the XD-Violence dataset. Our code is available at https://github.com/xjpp2016/MAVD.

View on arXiv PDF Code

Similar