CLLGAug 27, 2025

Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis

arXiv:2508.19831v22 citationsh-index: 3Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the lack of high-quality benchmarks for Hindi LLMs, which is a problem for researchers and developers working with low-resource languages, though it is incremental as it adapts existing benchmark concepts.

The authors tackled the challenge of evaluating Hindi instruction-tuned LLMs by creating a suite of five Hindi evaluation datasets using human annotation and translate-and-verify methods, and used it to benchmark open-source Hindi LLMs with a comparative analysis.

Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes