CL AI IRSep 16, 2024

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, Han Xiao

arXiv:2409.10173v326.7165 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses the need for high-quality, efficient multilingual embeddings in retrieval and classification tasks, though it appears incremental as it builds on existing methods like LoRA and Matryoshka Representation Learning.

The paper tackles the problem of multilingual and long-context text embedding by introducing jina-embeddings-v3, a 570M-parameter model that achieves state-of-the-art performance on the MTEB benchmark, outperforming proprietary models like OpenAI and Cohere on English tasks and multilingual-e5-large-instruct on multilingual tasks, with support for up to 8192 tokens and flexible dimension reduction to as low as 32.

We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks. With a default output dimension of 1024, users can flexibly reduce the embedding dimensions to as low as 32 without compromising performance, enabled by Matryoshka Representation Learning.

View on arXiv PDF

Similar