Incorporating Scalability in Unsupervised Spatio-Temporal Feature Learning
This addresses the tedious task of acquiring labeled video data for computer vision, offering an incremental improvement in unsupervised feature learning.
The paper tackles the problem of learning spatio-temporal features from unlabeled videos to reduce reliance on manual labeling, proposing a simple Convolutional 3D Siamese network trained with mined pairs, and shows it learns weights transferable across datasets and tasks.
Deep neural networks are efficient learning machines which leverage upon a large amount of manually labeled data for learning discriminative features. However, acquiring substantial amount of supervised data, especially for videos can be a tedious job across various computer vision tasks. This necessitates learning of visual features from videos in an unsupervised setting. In this paper, we propose a computationally simple, yet effective, framework to learn spatio-temporal feature embedding from unlabeled videos. We train a Convolutional 3D Siamese network using positive and negative pairs mined from videos under certain probabilistic assumptions. Experimental results on three datasets demonstrate that our proposed framework is able to learn weights which can be used for same as well as cross dataset and tasks.