CL LGFeb 16, 2025

ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition

Bidyarthi Paul, Faika Fairuj Preotee, Shuvashis Sarker, Shamim Rahim Refat, Shifat Islam, Tashreef Muhammad, Mohammad Ashraful Hoque, Shahriar Manzoor

arXiv:2502.11198v36.74 citationsh-index: 4PLoS ONE

Originality Synthesis-oriented

AI Analysis

This work addresses a critical gap for NLP researchers and practitioners working with low-resource Bangla dialects, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of Named Entity Recognition (NER) resources for Bangla regional dialects by introducing ANCHOLIK-NER, a benchmark dataset of 17,405 sentences across five regions, and found that BERT Base Multilingual Cased performed best with an F1-score of 82.611% in Mymensingh.

Named Entity Recognition (NER) in regional dialects is a critical yet underexplored area in Natural Language Processing (NLP), especially for low-resource languages like Bangla. While NER systems for Standard Bangla have made progress, no existing resources or models specifically address the challenge of regional dialects such as Barishal, Chittagong, Mymensingh, Noakhali, and Sylhet, which exhibit unique linguistic features that existing models fail to handle effectively. To fill this gap, we introduce ANCHOLIK-NER, the first benchmark dataset for NER in Bangla regional dialects, comprising 17,405 sentences distributed across five regions. The dataset was sourced from publicly available resources and supplemented with manual translations, ensuring alignment of named entities across dialects. We evaluate three transformer-based models - Bangla BERT, Bangla BERT Base, and BERT Base Multilingual Cased - on this dataset. Our findings demonstrate that BERT Base Multilingual Cased performs best in recognizing named entities across regions, with significant performance observed in Mymensingh with an F1-score of 82.611%. Despite strong overall performance, challenges remain in region like Chittagong, where the models show lower precision and recall. Since no previous NER systems for Bangla regional dialects exist, our work represents a foundational step in addressing this gap. Future work will focus on improving model performance in underperforming regions and expanding the dataset to include more dialects, enhancing the development of dialect-aware NER systems.

View on arXiv PDF

Similar