Automatic Identification of Closely-related Indian Languages: Resources and Experiments
This work addresses the challenge of language identification for closely-related Indian languages, which is incremental as it builds on existing methods but applies them to new data and provides novel resources.
The paper tackled the problem of automatically identifying five closely-related Indian languages by developing a system that achieved state-of-the-art accuracy of 96.48%. It also conducted the first data-based study to analyze lexical similarities among these languages.
In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India, Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state of the art accuracy of 96.48\%. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of closeness of these languages.