CL AINov 11, 2025

BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

Abdullah Muhammad Moosa, Nusrat Sultana, Mahdi Muhammad Moosa, Md. Miraiz Hossain

arXiv:2511.08085v1h-index: 17

Originality Synthesis-oriented

AI Analysis

This work addresses authorship attribution for Bangla language researchers by providing a new benchmark and insights into stop-word importance, though it is incremental as it builds on existing methods and datasets.

The research tackled Bangla authorship attribution by introducing a new benchmark corpus BARD10 and analyzing stop-word removal, finding that classical TF-IDF + SVM outperformed deep learning models with macro-F1 scores up to 0.997 on BAAD16 and 0.921 on BARD10, and revealing that Bangla stop-words are crucial stylistic indicators with genre-dependent effects.

This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop-word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop-words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative classifiers: SVM (Support Vector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perception), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF-IDF + SVM baseline outperformed, attaining a macro-F1 score of 0.997 on BAAD16 and 0.921 on BARD10, while Bangla BERT lagged by as much as five points. This study reveals that BARD10 authors are highly sensitive to stop-word pruning, while BAAD16 authors remain comparatively robust highlighting genre-dependent reliance on stop-word signatures. Error analysis revealed that high frequency components transmit authorial signatures that are diminished or reduced by transformer models. Three insights are identified: Bangla stop-words serve as essential stylistic indicators; finely calibrated ML models prove effective within short-text limitations; and BARD10 connects formal literature with contemporary web dialogue, offering a reproducible benchmark for future long-context or domain-adapted transformers.

View on arXiv PDF

Similar