CVDec 9, 2022

CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, Ngan Le

CMU

arXiv:2212.05136v324.3131 citationsh-index: 24Has Code

Originality Highly original

AI Analysis

This work addresses the problem of localizing anomalies in surveillance videos with weak supervision, offering a significant improvement over existing methods.

The paper tackles weakly-supervised video anomaly detection by using CLIP's ViT features and a novel Temporal Self-Attention method, achieving state-of-the-art performance with large margins on three benchmark datasets.

Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus, and XD-Violence). Our source code is available at https://github.com/joos2010kj/CLIP-TSA.

View on arXiv PDF Code

Similar