CLDec 4, 2025

AdiBhashaa: A Community-Curated Benchmark for Machine Translation into Indian Tribal Languages

arXiv:2512.04765v1h-index: 1
Originality Synthesis-oriented
AI Analysis

This addresses the problem of digital exclusion for tribal communities in India, representing a foundational step but incremental in scope as it focuses on specific languages.

The paper tackles the lack of machine translation resources for Indian tribal languages by creating AdiBhashaa, a community-curated benchmark with parallel corpora and baseline systems for four languages, resulting in the first open datasets and initial translation models.

Large language models and multilingual machine translation (MT) systems increasingly drive access to information, yet many languages of the tribal communities remain effectively invisible in these technologies. This invisibility exacerbates existing structural inequities in education, governance, and digital participation. We present AdiBhashaa, a community-driven initiative that constructs the first open parallel corpora and baseline MT systems for four major Indian tribal languages-Bhili, Mundari, Gondi, and Santali. This work combines participatory data creation with native speakers, human-in-the-loop validation, and systematic evaluation of both encoder-decoder MT models and large language models. In addition to reporting technical findings, we articulate how AdiBhashaa illustrates a possible model for more equitable AI research: it centers local expertise, builds capacity among early-career researchers from marginalized communities, and foregrounds human validation in the development of language technologies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes