CLAIOct 8, 2025

Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

arXiv:2510.07000v13 citationsh-index: 3Has CodeProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Originality Incremental advance
AI Analysis

This addresses the lack of culturally-relevant, diverse training data for Indian language LLMs, though it's an incremental improvement in dataset curation methodology.

The authors tackled the problem of insufficient high-quality post-training data for Indian language LLMs by creating two culturally-grounded datasets (Pragyaan-IT with 22.5K examples and Pragyaan-Align with 100K examples) across 10 languages using a human-in-the-loop pipeline that combines translations with synthetic expansion.

The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes