CVDec 29, 2024

Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

arXiv:2412.20455v121 citationsh-index: 62024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Incremental advance
AI Analysis

This work addresses video anomaly detection for safety and content moderation, but it appears incremental as it builds on existing weakly supervised methods with new fusion and attention mechanisms.

The paper tackles weakly supervised video anomaly detection for violence and nudity by proposing a multi-modal framework with a Cross-modal Fusion Adapter and Hyperbolic Lorentzian Graph Attention, achieving state-of-the-art results on benchmark datasets.

Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes