CVMay 25, 2025

Advancing Video Self-Supervised Learning via Image Foundation Models

arXiv:2505.19218v13.6h-index: 1Has CodePattern Recognition Letters

Originality Incremental advance

AI Analysis

This work addresses the challenge of computational cost for researchers and practitioners in video representation learning, though it is incremental as it builds on existing image foundation models and self-supervised techniques.

The paper tackles the problem of high training overhead in video self-supervised learning by leveraging pre-trained image foundation models, achieving performance comparable to state-of-the-art methods while reducing training time by 3.4× and GPU memory usage by 8.2× on UCF101.

In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by $3.4\times$ and GPU memory usage by $8.2\times$. This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at https://github.com/JingwWu/advise-video-ssl.

View on arXiv PDF Code

Similar