MM CVAug 16, 2022

M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval

Shuo Liu, Weize Quan, Ming Zhou, Sihong Chen, Jian Kang, Zhe Zhao, Chen Chen, Dong-Ming Yan

arXiv:2208.07664v12.33 citationsh-index: 15Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of effectively integrating multi-modal cues in videos for retrieval tasks, offering an incremental improvement over existing methods.

The paper tackles the problem of text-video retrieval by proposing a multi-level multi-modal hybrid fusion network to explore comprehensive interactions between text queries and video modalities, achieving state-of-the-art results such as Rank@1 scores of 64.9% on MSR-VTT and 68.2% on MSVD.

Videos contain multi-modal content, and exploring multi-level cross-modal interactions with natural language queries can provide great prominence to text-video retrieval task (TVR). However, new trending methods applying large-scale pre-trained model CLIP for TVR do not focus on multi-modal cues in videos. Furthermore, the traditional methods simply concatenating multi-modal features do not exploit fine-grained cross-modal information in videos. In this paper, we propose a multi-level multi-modal hybrid fusion (M2HF) network to explore comprehensive interactions between text queries and each modality content in videos. Specifically, M2HF first utilizes visual features extracted by CLIP to early fuse with audio and motion features extracted from videos, obtaining audio-visual fusion features and motion-visual fusion features respectively. Multi-modal alignment problem is also considered in this process. Then, visual features, audio-visual fusion features, motion-visual fusion features, and texts extracted from videos establish cross-modal relationships with caption queries in a multi-level way. Finally, the retrieval outputs from all levels are late fused to obtain final text-video retrieval results. Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner. Moreover, a novel multi-modal balance loss function is proposed to balance the contributions of each modality for efficient end-to-end training. M2HF allows us to obtain state-of-the-art results on various benchmarks, eg, Rank@1 of 64.9\%, 68.2\%, 33.2\%, 57.1\%, 57.8\% on MSR-VTT, MSVD, LSMDC, DiDeMo, and ActivityNet, respectively.

View on arXiv PDF Code

Similar