CVMay 28, 2019

Progressive Cross-Stream Cooperation in Spatial and Temporal Domain for Action Localization

arXiv:1905.11575v232 citations
Originality Incremental advance
AI Analysis

This work addresses action localization in videos, which is important for applications like surveillance and human-computer interaction, but it appears incremental as it builds on existing multi-stream methods.

The paper tackles spatio-temporal action localization by proposing a Progressive Cross-Stream Cooperation (PCSC) framework that iteratively improves spatial and temporal localization and action classification using RGB and Flow streams, achieving enhanced results on UCF-101-24 and J-HMDB datasets.

Spatio-temporal action localization consists of three levels of tasks: spatial localization, action classification, and temporal localization. In this work, we propose a new progressive cross-stream cooperation (PCSC) framework that improves all three tasks above. The basic idea is to utilize both spatial region (resp., temporal segment proposals) and features from one stream (i.e., the Flow/RGB stream) to help another stream (i.e., the RGB/Flow stream) to iteratively generate better bounding boxes in the spatial domain (resp., temporal segments in the temporal domain). In this way, not only the actions could be more accurately localized both spatially and temporally, but also the action classes could be predicted more precisely. Specifically, we first combine the latest region proposals (for spatial detection) or segment proposals (for temporal localization) from both streams to form a larger set of labelled training samples to help learn better action detection or segment detection models. Second, to learn better representations, we also propose a new message passing approach to pass information from one stream to another stream, which also leads to better action detection and segment detection models. By first using our newly proposed PCSC framework for spatial localization at the frame-level and then applying our temporal PCSC framework for temporal localization at the tube-level, the action localization results are progressively improved at both the frame level and the video level. Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate the effectiveness of our newly proposed approaches for spatio-temporal action localization in realistic scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes