CVAICLNov 21, 2022

SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

arXiv:2211.11446v318 citationsh-index: 134
Originality Incremental advance
AI Analysis

This addresses the problem of computational inefficiency in video-language pre-training for researchers and practitioners, offering an incremental improvement over existing methods.

The paper tackled the high computational cost of video-language pre-training by developing SMAUG, an efficient framework that uses cross-modal masking and token sparsification, achieving competitive performance on text-to-video retrieval and video question answering with 1.9X or more reduction in pre-training costs, such as requiring only about 50 GPU hours.

Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9X or more. For example, our SMAUG only needs about 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes