Learning Temporal Action Proposals With Fewer Labels
This work addresses the high annotation cost in action detection for video analysis, offering an incremental improvement by reducing label requirements while maintaining or outperforming state-of-the-art performance.
The paper tackles the problem of training temporal action proposal networks with limited labeled data, proposing a semi-supervised learning algorithm that generates significantly better proposals than fully-supervised and other semi-supervised methods, as validated on ActivityNet v1.3 and THUMOS14 datasets.
Temporal action proposals are a common module in action detection pipelines today. Most current methods for training action proposal modules rely on fully supervised approaches that require large amounts of annotated temporal action intervals in long video sequences. The large cost and effort in annotation that this entails motivate us to study the problem of training proposal modules with less supervision. In this work, we propose a semi-supervised learning algorithm specifically designed for training temporal action proposal networks. When only a small number of labels are available, our semi-supervised method generates significantly better proposals than the fully-supervised counterpart and other strong semi-supervised baselines. We validate our method on two challenging action detection video datasets, ActivityNet v1.3 and THUMOS14. We show that our semi-supervised approach consistently matches or outperforms the fully supervised state-of-the-art approaches.