CLApr 11, 2025

Integrated ensemble of BERT- and features-based models for authorship attribution in Japanese literary works

Taisei Kanda, Mingzhe Jin, Wataru Zaitsu

arXiv:2504.08527v1h-index: 5

Originality Incremental advance

AI Analysis

This work addresses authorship attribution for researchers in computational linguistics, offering an incremental improvement by combining existing methods for small-sample tasks.

The study tackled authorship attribution in Japanese literary works with small samples by integrating BERT-based and traditional feature-based models, resulting in an integrated ensemble that improved the F1 score by approximately 14 points compared to the best single model on a corpus not included in pre-training data.

Traditionally, authorship attribution (AA) tasks relied on statistical data analysis and classification based on stylistic features extracted from texts. In recent years, pre-trained language models (PLMs) have attracted significant attention in text classification tasks. However, although they demonstrate excellent performance on large-scale short-text datasets, their effectiveness remains under-explored for small samples, particularly in AA tasks. Additionally, a key challenge is how to effectively leverage PLMs in conjunction with traditional feature-based methods to advance AA research. In this study, we aimed to significantly improve performance using an integrated integrative ensemble of traditional feature-based and modern PLM-based methods on an AA task in a small sample. For the experiment, we used two corpora of literary works to classify 10 authors each. The results indicate that BERT is effective, even for small-sample AA tasks. Both BERT-based and classifier ensembles outperformed their respective stand-alone models, and the integrated ensemble approach further improved the scores significantly. For the corpus that was not included in the pre-training data, the integrated ensemble improved the F1 score by approximately 14 points, compared to the best-performing single model. Our methodology provides a viable solution for the efficient use of the ever-expanding array of data processing tools in the foreseeable future.

View on arXiv PDF

Similar