CLJun 2, 2025

IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems

Microsoft
arXiv:2506.01615v2h-index: 31Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited RAG capabilities for Indian language users, providing essential datasets and benchmarks, though it is incremental as it adapts existing methods to new data.

The paper tackles the lack of resources for Retrieval-Augmented Generation (RAG) systems in Indian languages by creating IndicMSMarco, a benchmark for 13 languages with 1000 queries, and a large-scale training dataset from 19 language Wikipedias, enabling improved evaluation and development.

Retrieval-Augmented Generation (RAG) systems enable language models to access relevant information and generate accurate, well-grounded, and contextually informed responses. However, for Indian languages, the development of high-quality RAG systems is hindered by the lack of two critical resources: (1) evaluation benchmarks for retrieval and generation tasks, and (2) large-scale training datasets for multilingual retrieval. Most existing benchmarks and datasets are centered around English or high-resource languages, making it difficult to extend RAG capabilities to the diverse linguistic landscape of India. To address the lack of evaluation benchmarks, we create IndicMSMarco, a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages, created via manual translation of 1000 diverse queries from MS MARCO-dev set. To address the need for training data, we build a large-scale dataset of (question, answer, relevant passage) tuples derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs. Additionally, we include translated versions of the original MS MARCO dataset to further enrich the training data and ensure alignment with real-world information-seeking tasks. Resources are available here: https://huggingface.co/collections/ai4bharat/indicragsuite-683e7273cb2337208c8c0fcb

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes