CVAug 22, 2019

3C-Net: Category Count and Center Loss for Weakly-Supervised Action Localization

Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, Ling Shao

arXiv:1908.08216v221.0174 citationsHas Code

Originality Highly original

AI Analysis

This addresses the problem of reducing annotation effort for action localization in videos, though it is incremental as it builds on existing weakly-supervised approaches.

The paper tackles weakly-supervised temporal action localization by proposing 3C-Net, which uses only video-level supervision with action category labels and counts, achieving a 4.6% absolute gain in mean average precision on THUMOS14 compared to state-of-the-art methods.

Temporal action localization is a challenging computer vision problem with numerous real-world applications. Most existing methods require laborious frame-level supervision to train action localization models. In this work, we propose a framework, called 3C-Net, which only requires video-level supervision (weak supervision) in the form of action category labels and the corresponding count. We introduce a novel formulation to learn discriminative action features with enhanced localization capabilities. Our joint formulation has three terms: a classification term to ensure the separability of learned action features, an adapted multi-label center loss term to enhance the action feature discriminability and a counting loss term to delineate adjacent action sequences, leading to improved localization. Comprehensive experiments are performed on two challenging benchmarks: THUMOS14 and ActivityNet 1.2. Our approach sets a new state-of-the-art for weakly-supervised temporal action localization on both datasets. On the THUMOS14 dataset, the proposed method achieves an absolute gain of 4.6% in terms of mean average precision (mAP), compared to the state-of-the-art. Source code is available at https://github.com/naraysa/3c-net.

View on arXiv PDF Code

Similar