CLAIOct 29, 2025

BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

arXiv:2510.25409v23 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

This addresses the problem of Anglocentric and domain-agnostic benchmarks for researchers and developers working on Indic language models, though it is incremental as it introduces a new dataset rather than a novel method.

The authors tackled the lack of domain and culture-specific evaluation benchmarks for large language models in India-centric contexts by introducing BhashaBench V1, a bilingual benchmark with 74,166 question-answer pairs across four Indic domains, revealing significant performance gaps such as GPT-4o achieving 76.49% accuracy in Legal but only 59.74% in Ayurveda.

The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes