CL LGOct 11, 2024

L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi

Pranita Deshmukh, Nikita Kulkarni, Sanhita Kulkarni, Kareena Manghani, Raviraj Joshi

arXiv:2410.09184v11.95 citationsh-index: 3Has CodeFire

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of limited NLP resources for Indic languages like Marathi, providing a dataset and model for researchers, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of resources for abstractive text summarization in Marathi by creating the MahaSUM dataset with 25k samples and training an IndicBART model, demonstrating its effectiveness in producing high-quality summaries.

We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi, designed to facilitate the training and evaluation of models for abstractive summarization tasks in Indic languages. The dataset, containing 25k samples, was created by scraping articles from a wide range of online news sources and manually verifying the abstract summaries. Additionally, we train an IndicBART model, a variant of the BART model tailored for Indic languages, using the MahaSUM dataset. We evaluate the performance of our trained models on the task of abstractive summarization and demonstrate their effectiveness in producing high-quality summaries in Marathi. Our work contributes to the advancement of natural language processing research in Indic languages and provides a valuable resource for future research in this area using state-of-the-art models. The dataset and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP

View on arXiv PDF Code

Similar