CLAug 28, 2024

Conan-embedding: General Text Embedding with More and Better Negative Samples

arXiv:2408.15710v225 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the need for better embedding models in retrieval-augmented generation (RAG) systems, offering incremental improvements through enhanced negative sampling techniques.

The paper tackled the problem of improving text embedding models by proposing a method to use more and higher-quality negative samples during training, resulting in a model that ranks first on the Chinese leaderboard of the Massive Text Embedding Benchmark.

With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes