CLIRJan 1, 2025

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

arXiv:2501.00874v31 citationsh-index: 16
Originality Incremental advance
AI Analysis

This addresses the need for improved multilingual text embeddings in AI applications, offering a novel method to enhance performance without requiring explicit multilingual training data, though it is incremental in adapting existing models.

The paper tackled the problem of limited multilingual capabilities in large language model-based embedding models by introducing LUSIFER, a zero-shot approach that adapts these models for multilingual tasks without multilingual supervision, resulting in significant performance enhancements across various embedding tasks, especially for medium and low-resource languages.

Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes