CLAIFeb 2, 2024

Nomic Embed: Training a Reproducible Long Context Text Embedder

arXiv:2402.01613v2307 citationsh-index: 8Has CodeTrans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This provides an open-source, reproducible solution for text embedding tasks, addressing limitations in existing models, though it is incremental in improving performance.

The authors tackled the problem of creating a reproducible long-context text embedding model, resulting in nomic-embed-text-v1, which outperforms OpenAI models on both short-context MTEB and long-context LoCo benchmarks.

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes