CL CRMar 24

Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis

Hiroshi Matsubara, Shingo Matsugaya, Taichi Aoki, Masaki Hashimoto

arXiv:2604.1637625.6h-index: 7

Predicted impact top 94% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

For threat intelligence analysts, this work provides a foundational comparison of attribution methods on Japanese text, but results are incremental and domain-specific.

This study evaluates authorship attribution methods for Japanese web reviews to support actor analysis in threat intelligence. BERT fine-tuning performed best with small author sets, but TF-IDF with logistic regression was more stable and accurate when scaling to hundreds of authors.

This study investigates the applicability of authorship attribution based on stylistic features to support actor analysis in threat intelligence. As a foundational step toward future application to dark web forums, we conducted experiments using Japanese review data from clear web sources. We constructed datasets from Rakuten Ichiba reviews and compared four methods: TF-IDF with logistic regression (TF-IDF+LR), BERT embeddings with logistic regression (BERT-Emb+LR), BERT fine-tuning (BERT-FT), and metric learning with $k$-nearest neighbors (Metric+kNN). Results showed that BERT-FT achieved the best performance; however, training became unstable as the number of authors scaled to several hundred, where TF-IDF+LR proved superior in terms of accuracy, stability, and computational cost. Furthermore, Top-$k$ evaluation demonstrated the utility of candidate screening, and error analysis revealed that boilerplate text, topic dependency, and short text length were primary factors causing misclassification.

View on arXiv PDF

Similar