CVMay 4, 2022

TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition

Haodong Duan, Nanxuan Zhao, Kai Chen, Dahua Lin

Peking U

arXiv:2205.02028v111.727 citationsh-index: 87Has Code

Originality Highly original

AI Analysis

This work addresses a bottleneck in video self-supervised learning for researchers, offering a novel method that improves performance on tasks like action recognition and video retrieval, though it is incremental relative to existing paradigms.

The paper tackles the problem of noisy supervision in self-supervised video representation learning by proposing TransRank, a ranking-based framework for transformation recognition, which achieves state-of-the-art improvements, such as surpassing previous methods by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition.

Recognizing transformation types applied to a video clip (RecogTrans) is a long-established paradigm for self-supervised video representation learning, which achieves much inferior performance compared to instance discrimination approaches (InstDisc) in recent works. However, based on a thorough comparison of representative RecogTrans and InstDisc methods, we observe the great potential of RecogTrans on both semantic-related and temporal-related downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation. TransRank provides accurate supervision signals by recognizing transformations relatively, consistently outperforming the classification-based formulation. Meanwhile, the unified framework can be instantiated with an arbitrary set of temporal or spatial transformations, demonstrating good generality. With a ranking-based formulation and several empirical practices, we achieve competitive performance on video retrieval and action recognition. Under the same setting, TransRank surpasses the previous state-of-the-art method by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition (Top1 Acc); improves video retrieval on UCF101 by 20.4% (R@1). The promising results validate that RecogTrans is still a worth exploring paradigm for video self-supervised learning. Codes will be released at https://github.com/kennymckormick/TransRank.

View on arXiv PDF Code

Similar