Video alignment using unsupervised learning of local and global features
This addresses the problem of aligning videos with similar actions for researchers and practitioners in computer vision, offering an unsupervised approach that avoids training data needs, though it is incremental as it builds on existing feature extraction and dynamic time warping techniques.
The paper tackles video alignment by matching frames of videos with similar actions using an unsupervised method that combines global and local features from person detection, pose estimation, and VGG networks, processed with a novel Diagonalized Dynamic Time Warping algorithm; it outperforms state-of-the-art methods like TCC on Penn action and UCF101 datasets.
In this paper, we tackle the problem of video alignment, the process of matching the frames of a pair of videos containing similar actions. The main challenge in video alignment is that accurate correspondence should be established despite the differences in the execution processes and appearances between the two videos. We introduce an unsupervised method for alignment that uses global and local features of the frames. In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network. Then the features are processed and combined to construct a multidimensional time series that represent the video. The resulting time series are used to align videos of the same actions using a novel version of dynamic time warping named Diagonalized Dynamic Time Warping(DDTW). The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it. Additionally, our approach can be used for framewise labeling of action phases in a dataset with only a few labeled videos. For evaluation, we considered video synchronization and phase classification tasks on the Penn action and subset of UCF101 datasets. Also, for an effective evaluation of the video synchronization task, we present a new metric called Enclosed Area Error(EAE). The results show that our method outperforms previous state-of-the-art methods, such as TCC, and other self-supervised and weakly supervised methods.