IR AISep 24, 2024

IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios

Hai Lin, Shaoxiong Zhan, Junyou Su, Haitao Zheng, Hui Wang

arXiv:2409.15763v25.55 citationsh-index: 3Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of benchmarking embedding models for researchers and developers in RAG scenarios, but it is incremental as it builds on existing evaluation frameworks by adding new tasks and metrics.

The paper tackles the lack of comprehensive evaluation methods for embedding models in multilingual Retrieval-Augmented Generation (RAG) tasks by introducing the IRSC benchmark, which includes five retrieval tasks and new metrics like SSCI and RCCI, and evaluates models such as Snowflake-Arctic and BGE to provide insights into cross-lingual limitations.

In Retrieval-Augmented Generation (RAG) tasks using Large Language Models (LLMs), the quality of retrieved information is critical to the final output. This paper introduces the IRSC benchmark for evaluating the performance of embedding models in multilingual RAG tasks. The benchmark encompasses five retrieval tasks: query retrieval, title retrieval, part-of-paragraph retrieval, keyword retrieval, and summary retrieval. Our research addresses the current lack of comprehensive testing and effective comparison methods for embedding models in RAG scenarios. We introduced new metrics: the Similarity of Semantic Comprehension Index (SSCI) and the Retrieval Capability Contest Index (RCCI), and evaluated models such as Snowflake-Arctic, BGE, GTE, and M3E. Our contributions include: 1) the IRSC benchmark, 2) the SSCI and RCCI metrics, and 3) insights into the cross-lingual limitations of embedding models. The IRSC benchmark aims to enhance the understanding and development of accurate retrieval systems in RAG tasks. All code and datasets are available at: https://github.com/Jasaxion/IRSC_Benchmark

View on arXiv PDF Code

Similar