Challenges in Developing LRs for Non-Scheduled Languages: A Case of Magahi
This addresses the problem of resource scarcity for Magahi speakers and researchers, but it is incremental as it applies existing methods to a new language.
The paper tackles the lack of language resources for Magahi, a non-scheduled Indo-Aryan language, by developing an annotated corpus using data from blogs, stories, and conversations, with POS tagging based on the BIS tagset.
Magahi is an Indo-Aryan Language, spoken mainly in the Eastern parts of India. Despite having a significant number of speakers, there has been virtually no language resource (LR) or language technology (LT) developed for the language, mainly because of its status as a non-scheduled language. The present paper describes an attempt to develop an annotated corpus of Magahi. The data is mainly taken from a couple of blogs in Magahi, some collection of stories in Magahi and the recordings of conversation in Magahi and it is annotated at the POS level using BIS tagset.