CVOct 30, 2024

PV-VTT: A Privacy-Centric Dataset for Mission-Specific Anomaly Detection and Natural Language Interpretation

Ryozo Masukawa, Sanggeon Yun, Yoshiki Yamaguchi, Mohsen Imani

arXiv:2410.22623v27.65 citationsh-index: 9WACV

Originality Incremental advance

AI Analysis

This addresses a privacy-centric anomaly detection problem for researchers in computer vision, though it is incremental as it builds on existing video-text datasets and methods.

The authors tackled the lack of datasets for detecting privacy violations as precursors to crimes by introducing PV-VTT, a multimodal dataset with video feature vectors and text annotations, and proposed a GNN-based model that reduces LLM input tokens by 50% while maintaining descriptive quality.

Video crime detection is a significant application of computer vision and artificial intelligence. However, existing datasets primarily focus on detecting severe crimes by analyzing entire video clips, often neglecting the precursor activities (i.e., privacy violations) that could potentially prevent these crimes. To address this limitation, we present PV-VTT (Privacy Violation Video To Text), a unique multimodal dataset aimed at identifying privacy violations. PV-VTT provides detailed annotations for both video and text in scenarios. To ensure the privacy of individuals in the videos, we only provide video feature vectors, avoiding the release of any raw video data. This privacy-focused approach allows researchers to use the dataset while protecting participant confidentiality. Recognizing that privacy violations are often ambiguous and context-dependent, we propose a Graph Neural Network (GNN)-based video description model. Our model generates a GNN-based prompt with image for Large Language Model (LLM), which deliver cost-effective and high-quality video descriptions. By leveraging a single video frame along with relevant text, our method reduces the number of input tokens required, maintaining descriptive quality while optimizing LLM API-usage. Extensive experiments validate the effectiveness and interpretability of our approach in video description tasks and flexibility of our PV-VTT dataset.

View on arXiv PDF

Similar