CVLGSep 15, 2022

On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition

arXiv:2209.07474v33 citationsh-index: 48
Originality Incremental advance
AI Analysis

This work addresses the challenge of limited labeled data for video recognition, offering a practical solution for researchers and practitioners in computer vision.

The paper tackles the problem of video classification in low-labeled data settings, finding that transformers outperform CNNs and even complex semi-supervised CNN methods, with significant performance gains observed across datasets like Kinetics-400 and SomethingSomething-V2.

Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and SomethingSomething-V2) and perform thorough analysis and ablation studies to explain this observation using the predominant features of video transformer architectures. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well. Our experiments inform our recommendation that semi-supervised learning video work should consider the use of video transformers in the future.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes