CVLGJul 15, 2022

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

arXiv:2207.07646v126 citationsh-index: 25
Originality Incremental advance
AI Analysis

It addresses the problem of recognizing novel video classes without labeled data for researchers and practitioners in video analysis, representing an incremental advance by adding multimodal fusion to an existing paradigm.

The paper tackles open-vocabulary video classification by extending pre-trained vision-language models to incorporate motion and audio modalities, achieving state-of-the-art results on benchmarks like UCF and HMDB with significant performance gains over existing methods.

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes