Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
This work addresses activity detection in videos, which is important for applications like surveillance and video analysis, but it is incremental as it builds on existing neural network methods.
This thesis tackled the problem of detecting and temporally localizing activities in untrimmed videos by using a combination of 3D Convolutional Neural Networks for feature extraction and Recurrent Neural Networks for classification and localization, achieving competitive results in the ActivityNet Challenge 2016.
This thesis explore different approaches using Convolutional and Recurrent Neural Networks to classify and temporally localize activities on videos, furthermore an implementation to achieve it has been proposed. As the first step, features have been extracted from video frames using an state of the art 3D Convolutional Neural Network. This features are fed in a recurrent neural network that solves the activity classification and temporally location tasks in a simple and flexible way. Different architectures and configurations have been tested in order to achieve the best performance and learning of the video dataset provided. In addition it has been studied different kind of post processing over the trained network's output to achieve a better results on the temporally localization of activities on the videos. The results provided by the neural network developed in this thesis have been submitted to the ActivityNet Challenge 2016 of the CVPR, achieving competitive results using a simple and flexible architecture.