Counting Grid Aggregation for Event Retrieval and Recognition
This work addresses the need for efficient video analysis for applications like content search, though it appears incremental as it builds on existing aggregation techniques.
The paper tackled the problem of event retrieval and recognition in videos by proposing a spatially consistent counting grid model to aggregate deep features across frames, achieving significantly better accuracy with more compact representations compared to existing methods.
Event retrieval and recognition in a large corpus of videos necessitates a holistic fixed-size visual representation at the video clip level that is comprehensive, compact, and yet discriminative. It shall comprehensively aggregate information across relevant video frames, while suppress redundant information, leading to a compact representation that can effectively differentiate among different visual events. In search for such a representation, we propose to build a spatially consistent counting grid model to aggregate together deep features extracted from different video frames. The spatial consistency of the counting grid model is achieved by introducing a prior model estimated from a large corpus of video data. The counting grid model produces an intermediate tensor representation for each video, which automatically identifies and removes the feature redundancy across the different frames. The tensor representation is subsequently reduced to a fixed-size vector representation by averaging over the counting grid. When compared to existing methods on both event retrieval and event classification benchmarks, we achieve significantly better accuracy with much more compact representation.