CVIRMar 26, 2025

MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion

arXiv:2503.20698v48 citationsh-index: 16SIGIR
Originality Incremental advance
AI Analysis

This addresses the need for more balanced multimodal retrieval in videos for users with diverse information needs, representing a novel method for a known bottleneck rather than a foundational advancement.

The paper tackled the problem of multimodal video retrieval, where existing models overly prioritize visual signals, by developing MMMORRF, a system that integrates text and features from visual and audio modalities with a novel modality-aware weighted reciprocal rank fusion, resulting in improvements of 81% in nDCG@20 over leading multimodal encoders and 37% over single-modality retrieval.

Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval, demonstrating the value of integrating diverse modalities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes