The Importance of Suppressing Domain Style in Authorship Analysis
This addresses a critical issue in authorship analysis for forensic and literary applications by revealing and mitigating domain bias, though it is incremental as it builds on existing methods.
The paper tackled the problem of authorship analysis by showing that common character trigram features are highly susceptible to domain information, suffering accuracy drops of up to 55.4 percentage points under domain swapping, and proposed a domain-adversarial learning remedy that reduced losses to 3.6%.
The prerequisite of many approaches to authorship analysis is a representation of writing style. But despite decades of research, it still remains unclear to what extent commonly used and widely accepted representations like character trigram frequencies actually represent an author's writing style, in contrast to more domain-specific style components or even topic. We address this shortcoming for the first time in a novel experimental setup of fixed authors but swapped domains between training and testing. With this setup, we reveal that approaches using character trigram features are highly susceptible to favor domain information when applied without attention to domains, suffering drops of up to 55.4 percentage points in classification accuracy under domain swapping. We further propose a new remedy based on domain-adversarial learning and compare it to ones from the literature based on heuristic rules. Both can work well, reducing accuracy losses under domain swapping to 3.6% and 3.9%, respectively.