CL AINov 24, 2025

MultiBanAbs: A Comprehensive Multi-Domain Bangla Abstractive Text Summarization Dataset

Md. Tanzim Ferdous, Naeem Ahsan Chowdhury, Prithwiraj Bhattacharjee

arXiv:2511.19317v1

Originality Synthesis-oriented

AI Analysis

This addresses the need for adaptable summarization systems to reduce information overload for readers of Bangla content, though it is incremental as it builds on existing methods for a new dataset.

The study tackled the lack of diverse Bangla text summarization datasets by creating MultiBanAbs, a dataset of over 54,000 Bangla articles and summaries from multiple sources like blogs and newspapers, establishing baselines with models such as LSTM and BanglaT5-small to demonstrate its potential as a benchmark.

This study developed a new Bangla abstractive summarization dataset to generate concise summaries of Bangla articles from diverse sources. Most existing studies in this field have concentrated on news articles, where journalists usually follow a fixed writing style. While such approaches are effective in limited contexts, they often fail to adapt to the varied nature of real-world Bangla texts. In today's digital era, a massive amount of Bangla content is continuously produced across blogs, newspapers, and social media. This creates a pressing need for summarization systems that can reduce information overload and help readers understand content more quickly. To address this challenge, we developed a dataset of over 54,000 Bangla articles and summaries collected from multiple sources, including blogs such as Cinegolpo and newspapers such as Samakal and The Business Standard. Unlike single-domain resources, our dataset spans multiple domains and writing styles. It offers greater adaptability and practical relevance. To establish strong baselines, we trained and evaluated this dataset using several deep learning and transfer learning models, including LSTM, BanglaT5-small, and MTS-small. The results highlight its potential as a benchmark for future research in Bangla natural language processing. This dataset provides a solid foundation for building robust summarization systems and helps expand NLP resources for low-resource languages.

View on arXiv PDF

Similar