AIJan 29, 2023

BERT-based Authorship Attribution on the Romanian Dataset called ROST

arXiv:2301.12500v11 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses authorship attribution for Romanian language texts, which is an incremental application of existing methods to a new dataset.

The authors tackled authorship attribution for Romanian texts using a BERT-based model on an unbalanced dataset, achieving results that sometimes exceed 87% macro-accuracy.

Being around for decades, the problem of Authorship Attribution is still very much in focus currently. Some of the more recent instruments used are the pre-trained language models, the most prevalent being BERT. Here we used such a model to detect the authorship of texts written in the Romanian language. The dataset used is highly unbalanced, i.e., significant differences in the number of texts per author, the sources from which the texts were collected, the time period in which the authors lived and wrote these texts, the medium intended to be read (i.e., paper or online), and the type of writing (i.e., stories, short stories, fairy tales, novels, literary articles, and sketches). The results are better than expected, sometimes exceeding 87\% macro-accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes