Organizing Multimedia Data in Video Surveillance Systems Based on Face Verification with Convolutional Neural Networks
This work addresses the need for efficient organization of multimedia data in video surveillance systems, though it is incremental as it builds on existing face verification and clustering methods.
The paper tackles the problem of organizing video surveillance data by grouping face sequences from video frames using face verification and clustering, achieving the most accurate and fast solution with normalized average feature vectors from deep convolutional neural networks.
In this paper we propose the two-stage approach of organizing information in video surveillance systems. At first, the faces are detected in each frame and a video stream is split into sequences of frames with face region of one person. Secondly, these sequences (tracks) that contain identical faces are grouped using face verification algorithms and hierarchical agglomerative clustering. Gender and age are estimated for each cluster (person) in order to facilitate the usage of the organized video collection. The particular attention is focused on the aggregation of features extracted from each frame with the deep convolutional neural networks. The experimental results of the proposed approach using YTF and IJB-A datasets demonstrated that the most accurate and fast solution is achieved for matching of normalized average of feature vectors of all frames in a track.