CVJan 6, 2025

Large Language Models for Video Surveillance Applications

Ulindu De Silva, Leon Fernando, Billy Lau Pik Lik, Zann Koh, Sam Conrad Joyce, Belinda Yuen, Chau Yuen

arXiv:2501.02850v18.45 citationsh-index: 11TENCON

Originality Incremental advance

AI Analysis

This addresses the problem of efficient video analysis for surveillance operators, though it is incremental as it builds on existing Vision Language Models.

The paper tackles the challenge of analyzing large volumes of video surveillance data by developing a tool that uses Vision Language Models to generate customized textual summaries from CCTV footage, achieving 80% and 70% accuracy in temporal and spatial quality evaluations.

The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.

View on arXiv PDF

Similar