Complementary Boundary Generator with Scale-Invariant Relation Modeling for Temporal Action Localization: Submission to ActivityNet Challenge 2020
This work addresses the problem of accurately localizing and classifying actions in untrimmed videos for video analysis applications, representing an incremental improvement over existing methods.
The paper tackled temporal action localization by decoupling it into proposal generation and classification stages, enriching proposal diversity through multiple complementary components, and achieved state-of-the-art performance with 42.26 average mAP on the ActivityNet Challenge 2020 testing set.
This technical report presents an overview of our solution used in the submission to ActivityNet Challenge 2020 Task 1 (\textbf{temporal action localization/detection}). Temporal action localization requires to not only precisely locate the temporal boundaries of action instances, but also accurately classify the untrimmed videos into specific categories. In this paper, we decouple the temporal action localization task into two stages (i.e. proposal generation and classification) and enrich the proposal diversity through exhaustively exploring the influences of multiple components from different but complementary perspectives. Specifically, in order to generate high-quality proposals, we consider several factors including the video feature encoder, the proposal generator, the proposal-proposal relations, the scale imbalance, and ensemble strategy. Finally, in order to obtain accurate detections, we need to further train an optimal video classifier to recognize the generated proposals. Our proposed scheme achieves the state-of-the-art performance on the temporal action localization task with \textbf{42.26} average mAP on the challenge testing set.