CL IRFeb 8, 2024

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei

Microsoft

arXiv:2402.05672v136.2487 citationsh-index: 22Has Code

Originality Synthesis-oriented

AI Analysis

This provides efficient, high-quality multilingual text embeddings for NLP applications, but is incremental as it extends existing English E5 methods to multilingual data.

The report presents multilingual E5 text embedding models, trained using contrastive pre-training on 1 billion multilingual pairs and fine-tuning, with an instruction-tuned model matching state-of-the-art English-only models in performance.

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .

View on arXiv PDF Code

Similar