CLJul 11, 2023

Vacaspati: A Diverse Corpus of Bangla Literature

arXiv:2307.05083v1125 citationsh-index: 21
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited NLP resources for Bangla, a widely spoken language, by providing a foundational corpus and efficient models, though it is incremental in building upon existing methods.

The authors tackled the lack of a diverse, high-quality corpus for Bangla NLP by creating Vacaspati, a corpus with over 11 million sentences and 115 million words from literature, and developed models like Vac-BERT that perform better or similarly with fewer parameters and resources on downstream tasks.

Bangla (or Bengali) is the fifth most spoken language globally; yet, the state-of-the-art NLP in Bangla is lagging for even simple tasks such as lemmatization, POS tagging, etc. This is partly due to lack of a varied quality corpus. To alleviate this need, we build Vacaspati, a diverse corpus of Bangla literature. The literary works are collected from various websites; only those works that are publicly available without copyright violations or restrictions are collected. We believe that published literature captures the features of a language much better than newspapers, blogs or social media posts which tend to follow only a certain literary pattern and, therefore, miss out on language variety. Our corpus Vacaspati is varied from multiple aspects, including type of composition, topic, author, time, space, etc. It contains more than 11 million sentences and 115 million words. We also built a word embedding model, Vac-FT, using FastText from Vacaspati as well as trained an Electra model, Vac-BERT, using the corpus. Vac-BERT has far fewer parameters and requires only a fraction of resources compared to other state-of-the-art transformer models and yet performs either better or similar on various downstream tasks. On multiple downstream tasks, Vac-FT outperforms other FastText-based models. We also demonstrate the efficacy of Vacaspati as a corpus by showing that similar models built from other corpora are not as effective. The models are available at https://bangla.iitk.ac.in/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes