Authorship Identification in Bengali Literature: a Comparative Analysis
This addresses authorship attribution for Bengali texts, an understudied domain, but is incremental as it applies existing methods to new data.
The paper tackles authorship identification in Bengali literature by developing statistical and machine learning models using stylistic features, with SVM achieving the best performance after 10-fold cross-validation.
Stylometry is the study of the unique linguistic styles and writing behaviors of individuals. It belongs to the core task of text categorization like authorship identification, plagiarism detection etc. Though reasonable number of studies have been conducted in English language, no major work has been done so far in Bengali. In this work, We will present a demonstration of authorship identification of the documents written in Bengali. We adopt a set of fine-grained stylistic features for the analysis of the text and use them to develop two different models: statistical similarity model consisting of three measures and their combination, and machine learning model with Decision Tree, Neural Network and SVM. Experimental results show that SVM outperforms other state-of-the-art methods after 10-fold cross validations. We also validate the relative importance of each stylistic feature to show that some of them remain consistently significant in every model used in this experiment.