Submission to ActivityNet Challenge 2019: Task B Spatio-temporal Action Localization
This work addresses action localization in videos for computer vision applications, presenting incremental improvements over existing methods.
The authors tackled spatio-temporal action localization by proposing an end-to-end trainable architecture using only RGB sequential images, achieving improved performance through data augmentation and label subsampling methods.
This technical report present an overview of our system proposed for the spatio-temporal action localization(SAL) task in ActivityNet Challenge 2019. Unlike previous two-streams-based works, we focus on exploring the end-to-end trainable architecture using only RGB sequential images. To this end, we employ a previously proposed simple yet effective two-branches network called SlowFast Networks which is capable of capturing both short- and long-term spatiotemporal features. Moreover, to handle the severe class imbalance and overfitting problems, we propose a correlation-preserving data augmentation method and a random label subsampling method which have been proven to be able to reduce overfitting and improve the performance.