CLSep 30, 2022

PART: Pre-trained Authorship Representation Transformer

Javier Huertas-Tato, Alejandro Martin, David Camacho

arXiv:2209.15373v23.418 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of authorship attribution for researchers and practitioners by improving out-of-domain performance, though it is incremental as it builds on contrastive learning and stylometric representations.

The paper tackled the problem of learning authorship embeddings from text to identify authors across domains, achieving 72.39% zero-shot accuracy on a bounded set of 250 authors, which is 54% and 56% higher than RoBERTa embeddings.

Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. Using stylometric representations is more suitable, but this by itself is an open research challenge. In this paper, we propose PART, a contrastively trained model fit to learn \textbf{authorship embeddings} instead of semantics. We train our model on ~1.5M texts belonging to 1162 literature authors, 17287 blog posters and 135 corporate email accounts; a heterogeneous set with identifiable writing styles. We evaluate the model on current challenges, achieving competitive performance. We also evaluate our model on test splits of the datasets achieving zero-shot 72.39\% accuracy when bounded to 250 authors, a 54\% and 56\% higher than RoBERTa embeddings. We qualitatively assess the representations with different data visualizations on the available datasets, observing features such as gender, age, or occupation of the author.

View on arXiv PDF Code

Similar