CL LGFeb 2, 2022

L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

arXiv:2202.01159v231.4585 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This provides essential open resources for Marathi NLP, addressing a gap for a popular Indian language, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of monolingual resources for Marathi by creating L3Cube-MahaCorpus with 24.8M sentences and 289M tokens, and training BERT-based models like MahaBERT on 752M tokens, showing effectiveness in sentiment analysis, text classification, and NER tasks.

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We further present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. We show the effectiveness of these resources on downstream Marathi sentiment analysis, text classification, and named entity recognition (NER) tasks. We also release MahaGPT, a generative Marathi GPT model trained on Marathi corpus. Marathi is a popular language in India but still lacks these resources. This work is a step forward in building open resources for the Marathi language. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

View on arXiv PDF Code

Similar