CLIRDec 7, 2022

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Microsoft
arXiv:2212.03533v21339 citationsh-index: 102
AI Analysis

This provides a scalable solution for tasks like retrieval and classification, though it is incremental in improving existing embedding methods.

The paper tackles the problem of creating general-purpose text embeddings by introducing E5, a model trained with weakly-supervised contrastive pre-training, which achieves state-of-the-art performance in zero-shot retrieval by outperforming BM25 on BEIR and sets a new benchmark on MTEB when fine-tuned.

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes