CLSep 30, 2022

PART: Pre-trained Authorship Representation Transformer

arXiv:2209.15373v218 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the challenge of authorship attribution for researchers and practitioners by improving out-of-domain performance, though it is incremental as it builds on contrastive learning and stylometric representations.

The paper tackled the problem of learning authorship embeddings from text to identify authors across domains, achieving 72.39% zero-shot accuracy on a bounded set of 250 authors, which is 54% and 56% higher than RoBERTa embeddings.

Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. Using stylometric representations is more suitable, but this by itself is an open research challenge. In this paper, we propose PART, a contrastively trained model fit to learn \textbf{authorship embeddings} instead of semantics. We train our model on ~1.5M texts belonging to 1162 literature authors, 17287 blog posters and 135 corporate email accounts; a heterogeneous set with identifiable writing styles. We evaluate the model on current challenges, achieving competitive performance. We also evaluate our model on test splits of the datasets achieving zero-shot 72.39\% accuracy when bounded to 250 authors, a 54\% and 56\% higher than RoBERTa embeddings. We qualitatively assess the representations with different data visualizations on the available datasets, observing features such as gender, age, or occupation of the author.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes