CVMar 12, 2023

Accommodating Audio Modality in CLIP for Multimodal Processing

arXiv:2303.06591v118 citationsh-index: 17
Originality Incremental advance
AI Analysis

This work addresses the problem of integrating audio into multimodal AI systems for applications such as video retrieval and captioning, representing an incremental advancement over existing vision-language models.

The paper tackles the challenge of extending multimodal processing to include audio by adapting the CLIP model to handle vision, language, and audio modalities, achieving state-of-the-art performance on benchmark datasets like MSR-VTT, VATEX, and Audiocaps.

Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities in addition to the inner characteristics of the audio modality. Moreover, we further design an audio type token to dynamically learn different audio information type for different scenarios, as both verbal and nonverbal heterogeneous information is conveyed in general audios. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning, and achieves the state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes