CV CLSep 30, 2024

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Yabing Wang, Le Wang, Qiang Zhou, Zhibin Wang, Hao Li, Gang Hua, Wei Tang

arXiv:2409.19961v116.427 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This work is an incremental improvement for researchers and practitioners working on cross-lingual cross-modal retrieval, particularly in scenarios where human-labeled cross-modal data is unavailable.

This paper addresses the challenge of cross-lingual cross-modal retrieval (CCR) without human-labeled data by proposing LECCR, a method that uses a multimodal large language model (MLLM) to generate detailed visual descriptions and aggregate them into multi-view semantic slots. These slots enhance visual features, narrowing the semantic gap between modalities and generating local visual semantics for multi-level matching. The method also introduces softened matching with English guidance to improve inter-modal correspondences, demonstrating effectiveness across four CCR benchmarks: Multi30K, MSCOCO, VATEX, and MSR-VTT-CN.

Cross-lingual cross-modal retrieval (CCR) aims to retrieve visually relevant content based on non-English queries, without relying on human-labeled cross-modal data pairs during training. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs, establishing correspondence between visual and non-English textual data. However, aligning their representations poses challenges due to the significant semantic gap between vision and text, as well as the lower quality of non-English representations caused by pre-trained encoders and data noise. To overcome these challenges, we propose LECCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations. Specifically, we first employ MLLM to generate detailed visual content descriptions and aggregate them into multi-view semantic slots that encapsulate different semantics. Then, we take these semantic slots as internal features and leverage them to interact with the visual features. By doing so, we enhance the semantic information within the visual features, narrowing the semantic gap between modalities and generating local visual semantics for subsequent multi-level matching. Additionally, to further enhance the alignment between visual and non-English features, we introduce softened matching under English guidance. This approach provides more comprehensive and reliable inter-modal correspondences between visual and non-English features. Extensive experiments on four CCR benchmarks, \ie Multi30K, MSCOCO, VATEX, and MSR-VTT-CN, demonstrate the effectiveness of our proposed method. Code: \url{https://github.com/LiJiaBei-7/leccr}.

View on arXiv PDF Code

Similar