CVOct 4, 2023

Delving into CLIP latent space for Video Anomaly Recognition

Luca Zanella, Benedetta Liberatori, Willi Menapace, Fabio Poiesi, Yiming Wang, Elisa Ricci

arXiv:2310.02835v116.863 citationsh-index: 20Has Code

Originality Incremental advance

AI Analysis

It addresses the problem of automated surveillance video analysis for security applications, representing an incremental advance by applying existing models to a new task.

The paper tackles video anomaly detection and classification at the frame level using only video-level supervision by introducing AnomalyCLIP, which combines CLIP with multiple instance learning to manipulate latent features for identifying abnormal events, achieving state-of-the-art performance on benchmarks like ShanghaiTech, UCF-Crime, and XD-Violence.

We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method AnomalyCLIP, the first to combine Large Language and Vision (LLV) models, such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also introduce a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare AnomalyCLIP against state-of-the-art methods considering three major anomaly detection benchmarks, i.e. ShanghaiTech, UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines in recognising video anomalies.

View on arXiv PDF Code

Similar