CV CLMay 20, 2021

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer

arXiv:2105.09996v351.8737 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the need for more flexible and efficient multi-modal models for video understanding tasks, though it appears incremental as it builds on existing pre-training frameworks.

The paper tackles the problem of task-specific limitations in video-language pre-training by introducing a task-agnostic approach with new masking schemes that mix modalities while maintaining separability, resulting in strong performance across a wider range of tasks, often outperforming task-specific methods.

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

View on arXiv PDF Code

Similar