CVJun 8, 2025

Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment

arXiv:2506.06970v23 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the modality alignment problem in multimodal AI for improved retrieval tasks, representing an incremental advancement over existing methods.

The paper tackles the modality gap in cross-modal retrieval by introducing MAPLE, a framework that uses MLLM priors for preference alignment, resulting in substantial gains in fine-grained retrieval performance.

Despite Contrastive Language-Image Pretraining (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes