t-EVA: Time-Efficient t-SNE Video Annotation
This work addresses the cost-intensive problem of annotating large-scale video datasets for researchers and practitioners in video understanding.
The paper proposes t-EVA, a time-efficient video annotation method that uses spatio-temporal feature similarity and t-SNE dimensionality reduction. This method groups similar actions from different videos in a 2D space, allowing annotators to group-label video clips and significantly speed up the annotation process while maintaining video classification test accuracy.
Video understanding has received more attention in the past few years due to the availability of several large-scale video datasets. However, annotating large-scale video datasets are cost-intensive. In this work, we propose a time-efficient video annotation method using spatio-temporal feature similarity and t-SNE dimensionality reduction to speed up the annotation process massively. Placing the same actions from different videos near each other in the two-dimensional space based on feature similarity helps the annotator to group-label video clips. We evaluate our method on two subsets of the ActivityNet (v1.3) and a subset of the Sports-1M dataset. We show that t-EVA can outperform other video annotation tools while maintaining test accuracy on video classification.