CLDec 8, 2023

First Attempt at Building Parallel Corpora for Machine Translation of Northeast India's Very Low-Resource Languages

Atnafu Lambebo Tonja, Melkamu Mersha, Ananya Kalita, Olga Kolesnikova, Jugal Kalita

arXiv:2312.04764v19.830 citationsh-index: 14ICON

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of machine translation for very low-resource languages in Northeast India, but it is incremental as it builds on prior efforts with other languages.

The paper tackled the lack of parallel corpora for thirteen very low-resource languages from Northeast India by creating the first-ever bilingual datasets and providing initial neural machine translation benchmarks, though specific numerical results are not detailed.

This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpora to include a large number of low-resource Indian languages and integrate the effort with our prior work with African and American-Indian languages to create corpora covering a large number of languages from across the world.

View on arXiv PDF

Similar