BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
This addresses the problem of Anglocentric and domain-agnostic benchmarks for researchers and developers working on Indic language models, though it is incremental as it introduces a new dataset rather than a novel method.
The authors tackled the lack of domain and culture-specific evaluation benchmarks for large language models in India-centric contexts by introducing BhashaBench V1, a bilingual benchmark with 74,166 question-answer pairs across four Indic domains, revealing significant performance gaps such as GPT-4o achieving 76.49% accuracy in Legal but only 59.74% in Ayurveda.
The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.