Temporal Context Network for Activity Localization in Videos
This addresses the challenge of accurately detecting activity boundaries in videos for applications like video analysis and surveillance, representing an incremental improvement over existing methods.
The paper tackles the problem of precise temporal localization of human activities in videos by introducing a Temporal Context Network (TCN) that uses a novel representation to rank proposals, outperforming state-of-the-art methods on ActivityNet and THUMOS14 datasets.
We present a Temporal Context Network (TCN) for precise temporal localization of human activities. Similar to the Faster-RCNN architecture, proposals are placed at equal intervals in a video which span multiple temporal scales. We propose a novel representation for ranking these proposals. Since pooling features only inside a segment is not sufficient to predict activity boundaries, we construct a representation which explicitly captures context around a proposal for ranking it. For each temporal segment inside a proposal, features are uniformly sampled at a pair of scales and are input to a temporal convolutional neural network for classification. After ranking proposals, non-maximum suppression is applied and classification is performed to obtain final detections. TCN outperforms state-of-the-art methods on the ActivityNet dataset and the THUMOS14 dataset.