CLJun 26, 2025

DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning

Kang He, Yuzhe Ding, Haining Wang, Fei Li, Chong Teng, Donghong Ji

arXiv:2506.21096v29.63 citationsh-index: 11ACL

Originality Incremental advance

AI Analysis

This addresses multimodal representation quality for natural language processing applications, though it appears incremental as it builds on existing alignment methods.

The paper tackles cross-modal misalignment bias and intra-modal semantic divergence in multimodal sentence representation learning by proposing DALR with dual-level alignment learning, achieving superior performance over state-of-the-art baselines on semantic textual similarity and transfer tasks.

Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.

View on arXiv PDF

Similar