CLAIOct 26, 2025

Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal

arXiv:2510.22629v1h-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of language endangerment for the Toto-speaking community in West Bengal, India, by providing an interdisciplinary, community-based model, though it is incremental in combining existing linguistic and AI methods.

The paper tackled the preservation of the endangered Toto language by developing a trilingual language learning application and a morpheme-tagged corpus, resulting in the creation of a Small Language Model and a Transformer-based translation engine for digital archiving and revitalization.

Preserving linguistic diversity is necessary as every language offers a distinct perspective on the world. There have been numerous global initiatives to preserve endangered languages through documentation. This paper is a part of a project which aims to develop a trilingual (Toto-Bangla-English) language learning application to digitally archive and promote the endangered Toto language of West Bengal, India. This application, designed for both native Toto speakers and non-native learners, aims to revitalize the language by ensuring accessibility and usability through Unicode script integration and a structured language corpus. The research includes detailed linguistic documentation collected via fieldwork, followed by the creation of a morpheme-tagged, trilingual corpus used to train a Small Language Model (SLM) and a Transformer-based translation engine. The analysis covers inflectional morphology such as person-number-gender agreement, tense-aspect-mood distinctions, and case marking, alongside derivational strategies that reflect word-class changes. Script standardization and digital literacy tools were also developed to enhance script usage. The study offers a sustainable model for preserving endangered languages by incorporating traditional linguistic methodology with AI. This bridge between linguistic research with technological innovation highlights the value of interdisciplinary collaboration for community-based language revitalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes