CVOct 31, 2025

RzenEmbed: Towards Comprehensive Multimodal Retrieval

arXiv:2510.27350v113 citationsh-index: 9Has Code
Originality Incremental advance
AI Analysis

This work addresses a gap in multimodal retrieval for researchers and practitioners by extending capabilities beyond natural images to include videos and visual documents, though it is incremental as it builds on existing CLIP-based frameworks.

The paper tackles the limited support for diverse visual modalities like videos and visual documents in multimodal retrieval by introducing RzenEmbed, a unified framework that achieves state-of-the-art performance on the MMEB benchmark, including best overall score and superior results in video and visual document retrieval tasks.

The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model's discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in https://huggingface.co/qihoo360/RzenEmbed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes