CLAIAug 22, 2024

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

arXiv:2408.12503v222 citationsh-index: 12Has Code
AI Analysis

This work addresses the problem of evaluating and developing embedding models for the Russian language, providing tools for researchers and practitioners, though it is incremental as it adapts existing frameworks.

The paper tackles the lack of dedicated benchmarks and models for Russian text embeddings by introducing ruMTEB, a Russian version of the MTEB benchmark with seven task categories, and a new model ru-en-RoSBERTa, which achieves results on par with state-of-the-art models in Russian.

Embedding models play a crucial role in Natural Language Processing (NLP) by creating text embeddings used in various tasks such as information retrieval and assessing semantic text similarity. This paper focuses on research related to embedding models in the Russian language. It introduces a new Russian-focused embedding model called ru-en-RoSBERTa and the ruMTEB benchmark, the Russian version extending the Massive Text Embedding Benchmark (MTEB). Our benchmark includes seven categories of tasks, such as semantic textual similarity, text classification, reranking, and retrieval.The research also assesses a representative set of Russian and multilingual models on the proposed benchmark. The findings indicate that the new model achieves results that are on par with state-of-the-art models in Russian. We release the model ru-en-RoSBERTa, and the ruMTEB framework comes with open-source code, integration into the original framework and a public leaderboard.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes