A flexible model for training action localization with varying levels of supervision
This work addresses the scalability issue in video action localization for researchers and practitioners by enabling training with less tedious annotation, though it is incremental as it builds on existing weak supervision methods.
The authors tackled the problem of reducing manual annotation effort in spatio-temporal action detection by proposing a flexible framework that integrates varying levels of weak supervision, such as video-level labels, achieving competitive performance on UCF101-24 and DALY datasets with significantly less supervision than previous methods.
Spatio-temporal action detection in videos is typically addressed in a fully-supervised setup with manual annotation of training videos required at every frame. Since such annotation is extremely tedious and prohibits scalability, there is a clear need to minimize the amount of manual supervision. In this work we propose a unifying framework that can handle and combine varying types of less-demanding weak supervision. Our model is based on discriminative clustering and integrates different types of supervision as constraints on the optimization. We investigate applications of such a model to training setups with alternative supervisory signals ranging from video-level class labels to the full per-frame annotation of action bounding boxes. Experiments on the challenging UCF101-24 and DALY datasets demonstrate competitive performance of our method at a fraction of supervision used by previous methods. The flexibility of our model enables joint learning from data with different levels of annotation. Experimental results demonstrate a significant gain by adding a few fully supervised examples to otherwise weakly labeled videos.