CL IR LGFeb 17, 2025

FaMTEB: Massive Text Embedding Benchmark in Persian Language

Erfan Zinvandi, Morteza Alikhani, Mehran Sarmadi, Zahra Pourbahman, Sepehr Arvin, Reza Kazemi, Arash Amini

arXiv:2502.11571v215.512 citationsh-index: 28Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This work addresses the need for standardized evaluation tools for Persian language models, particularly in applications like chatbots and Retrieval-Augmented Generation systems, but it is incremental as it builds upon the existing MTEB framework.

The authors tackled the lack of a comprehensive benchmark for Persian text embeddings by introducing FaMTEB, which includes 63 datasets across seven tasks, such as classification and retrieval, and evaluated several embedding models. They contributed new datasets, including chatbot evaluation and summary retrieval tasks, and provided an open-source benchmark with a public leaderboard.

In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are formed as a combination of existing, translated, and newly generated data, offering a diverse evaluation framework for Persian language models. Given the increasing use of text embedding models in chatbots, evaluation datasets are becoming inseparable ingredients in chatbot challenges and Retrieval-Augmented Generation systems. As a contribution, we include chatbot evaluation datasets in the MTEB benchmark for the first time. In addition, in this paper, we introduce the new task of summary retrieval which is not part of the tasks included in standard MTEB. Another contribution of this paper is the introduction of a substantial number of new Persian language NLP datasets suitable for training and evaluation, some of which have no previous counterparts in Persian. We evaluate the performance of several Persian and multilingual embedding models in a range of tasks. This work introduces an open-source benchmark with datasets, code and a public leaderboard.

View on arXiv PDF

Similar