CL AI IRMay 8, 2024

Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models

Luke Merrick, Danmei Xu, Gaurav Nuti, Daniel Campos

arXiv:2405.05374v119.969 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This work provides scalable and efficient text embedding models for retrieval tasks, offering open-source alternatives to closed-source solutions.

The authors tackled the problem of text embedding by introducing the Arctic-Embed family of models, which achieved state-of-the-art retrieval accuracy on the MTEB Retrieval leaderboard, with the largest model outperforming closed-source competitors like Cohere's embed-v3 and OpenAI's text-embed-3-large.

This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance.

View on arXiv PDF

Similar