CVSep 21, 2024

Vision-Language Models Assisted Unsupervised Video Anomaly Detection

arXiv:2409.14109v26 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses video anomaly detection for industrial and academic applications, offering an incremental improvement through cross-modal integration.

The paper tackles the challenge of detecting unpredictable anomalies in videos with scarce samples by proposing VLAVAD, which uses vision-language models and a Sequence State Space Module to map visual features to semantic ones, achieving state-of-the-art results on the ShanghaiTech dataset.

Video anomaly detection is a subject of great interest across industrial and academic domains due to its crucial role in computer vision applications. However, the inherent unpredictability of anomalies and the scarcity of anomaly samples present significant challenges for unsupervised learning methods. To overcome the limitations of unsupervised learning, which stem from a lack of comprehensive prior knowledge about anomalies, we propose VLAVAD (Video-Language Models Assisted Anomaly Detection). Our method employs a cross-modal pre-trained model that leverages the inferential capabilities of large language models (LLMs) in conjunction with a Selective-Prompt Adapter (SPA) for selecting semantic space. Additionally, we introduce a Sequence State Space Module (S3M) that detects temporal inconsistencies in semantic features. By mapping high-dimensional visual features to low-dimensional semantic ones, our method significantly enhance the interpretability of unsupervised anomaly detection. Our proposed approach effectively tackles the challenge of detecting elusive anomalies that are hard to discern over periods, achieving SOTA on the challenging ShanghaiTech dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes