CVLGDec 16, 2021

Masked Feature Prediction for Self-Supervised Visual Pre-Training

arXiv:2112.09133v2850 citations
Originality Highly original
AI Analysis

This addresses the problem of reducing reliance on labeled data for video recognition tasks, offering a novel method with broad applicability.

The paper tackles self-supervised pre-training for video models by introducing Masked Feature Prediction (MaskFeat), which masks input sequences and predicts features, achieving state-of-the-art results such as 86.7% on Kinetics-400 and 75.0% on SSv2.

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.

Code Implementations6 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes