Multilingual E5 Text Embeddings: A Technical Report
This provides efficient, high-quality multilingual text embeddings for NLP applications, but is incremental as it extends existing English E5 methods to multilingual data.
The report presents multilingual E5 text embedding models, trained using contrastive pre-training on 1 billion multilingual pairs and fine-tuning, with an instruction-tuned model matching state-of-the-art English-only models in performance.
This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .