CLIRFeb 8, 2024

Multilingual E5 Text Embeddings: A Technical Report

Microsoft
arXiv:2402.05672v1457 citationsh-index: 22Has Code
Originality Synthesis-oriented
AI Analysis

This provides efficient, high-quality multilingual text embeddings for NLP applications, but is incremental as it extends existing English E5 methods to multilingual data.

The report presents multilingual E5 text embedding models, trained using contrastive pre-training on 1 billion multilingual pairs and fine-tuning, with an instruction-tuned model matching state-of-the-art English-only models in performance.

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes