CLFeb 17

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, Han Xiao

arXiv:2602.15547v15.710 citationsh-index: 14

Originality Incremental advance

AI Analysis

This work addresses the need for efficient, high-performance embedding models in applications like information retrieval and clustering, representing an incremental improvement in training methods for small models.

The paper tackles the problem of training compact text embedding models for semantic similarity tasks by introducing a novel training regimen that combines model distillation with task-specific contrastive loss, resulting in models (jina-embeddings-v5-text-small and -nano) that exceed or match state-of-the-art performance for similar-sized models and support long texts and robust embeddings.

Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.

View on arXiv PDF

Similar