CL LGJan 24, 2022

Text and Code Embeddings by Contrastive Pre-Training

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam

arXiv:2201.10005v122.4595 citationsh-index: 46

Originality Incremental advance

AI Analysis

This work addresses the need for versatile and efficient embeddings for applications like semantic search and text similarity, offering a unified approach that outperforms specialized models, though it is incremental in advancing pre-training methods.

The paper tackles the problem of generating high-quality text and code embeddings by using contrastive pre-training on unsupervised data at scale, achieving state-of-the-art results with improvements such as 4% and 1.8% over previous best models in linear-probe classification and up to 23.4% in semantic search benchmarks.

Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.

View on arXiv PDF

Similar