TharuChat: Bootstrapping Large Language Models for a Low-Resource Language via Synthetic Data and Human Validation

arXiv:2603.172202.5h-index: 1

Predicted impact top 100% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the digital divide for indigenous language speakers in the Global South by enabling AI access for Tharu, though it is incremental as it builds on existing methods for low-resource languages.

The paper tackled the exclusion of the low-resource Tharu language from AI by developing Tharu-LLaMA, a specialized LLM, using synthetic data from a bootstrapping pipeline, which reduced perplexity from 6.42 to 2.88 as dataset volume increased from 25% to 100%.

The rapid proliferation of Large Language Models (LLMs) has created a profound digital divide, effectively excluding indigenous languages of the Global South from the AI revolution. The Tharu language, an Indo-Aryan vernacular spoken by approximately 1.7 million people across the Terai belt of Nepal and India, exemplifies this crisis. Despite a rich oral tradition, Tharu suffers from severe data scarcity and linguistic fragmentation, causing state-of-the-art multilingual models to routinely "hallucinate" or default to dominant high-resource neighbors like Hindi and Nepali due to contamination in pre-training corpora. This paper presents Tharu-LLaMA (3B), a specialized instruction-following model designed to address this exclusion. We introduce TharuChat, a novel dataset constructed via a LLM-to-Human bootstrapping pipeline. We utilized prompt-engineered Gemini models, fed with Rana Tharu grammar and folklore, to synthesize training data. Unlike curated gold-standard corpora, TharuChat reflects the noisy, heterogeneous linguistic reality of the region: it is predominantly anchored in Rana Tharu (~70%) while integrating elements of Dangaura and Kochila dialects. We provide a transparent analysis of the dataset's limitations, including dialectal code-mixing and residual Awadhi/Hindi influence. Through a rigorous empirical ablation study, we demonstrate that despite these imperfections, small-scale synthetic data is highly effective, increasing the dataset volume from 25% to 100% results in a linear reduction in perplexity from 6.42 to 2.88. The resulting model serves as a proof-of-concept for the preservation of under-resourced Himalayan languages via generative AI, achievable on consumer-grade hardware.

View on arXiv PDF

Similar