CLIRMar 12, 2025

xVLM2Vec: Adapting LVLM-based embedding models to multilinguality using Self-Knowledge Distillation

arXiv:2503.09313v23 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the problem of limited multilingual and multimodal embedding capabilities for researchers and practitioners in AI, though it is incremental as it adapts existing models.

The paper tackles the lack of multilingual and multimodal embedding models by proposing a method to adapt Large Vision-Language Models trained on English data, resulting in improved performance in extracting such embeddings, and introduces a new benchmark for evaluation.

In the current literature, most embedding models are based on the encoder-only transformer architecture to extract a dense and meaningful representation of the given input, which can be a text, an image, and more. With the recent advances in language modeling thanks to the introduction of Large Language Models, the possibility of extracting embeddings from these large and extensively trained models has been explored. However, current studies focus on textual embeddings in English, which is also the main language on which these models have been trained. Furthermore, there are very few models that consider multimodal and multilingual input. In light of this, we propose an adaptation methodology for Large Vision-Language Models trained on English language data to improve their performance in extracting multilingual and multimodal embeddings. Finally, we design and introduce a benchmark to evaluate the effectiveness of multilingual and multimodal embedding models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes