Unsupervised Video Representation Learning by Bidirectional Feature Prediction
This work addresses the challenge of learning effective video representations without labeled data, which is important for researchers in computer vision, but it is incremental as it builds on existing contrastive learning frameworks.
The paper tackles the problem of self-supervised video representation learning by proposing a method that uses both future and past feature prediction, rather than just future prediction, to better capture temporal structure. The result is improved performance for action recognition, outperforming methods that predict only future or past features independently.
This paper introduces a novel method for self-supervised video representation learning via feature prediction. In contrast to the previous methods that focus on future feature prediction, we argue that a supervisory signal arising from unobserved past frames is complementary to one that originates from the future frames. The rationale behind our method is to encourage the network to explore the temporal structure of videos by distinguishing between future and past given present observations. We train our model in a contrastive learning framework, where joint encoding of future and past provides us with a comprehensive set of temporal hard negatives via swapping. We empirically show that utilizing both signals enriches the learned representations for the downstream task of action recognition. It outperforms independent prediction of future and past.