Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship
This provides a simple, effective method for authorship identification, benefiting fields like forensics and literary analysis, though it is incremental as it adapts an existing statistical test.
The authors tackled authorship attribution by adapting the Higher Criticism goodness-of-fit test to measure closeness between word-frequency tables, achieving state-of-the-art accuracy in various challenges without handcrafting or tuning.
We adapt the Higher Criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning; reporting accuracy at the state of the art level in various current challenges. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in comparing the similarity of a new document and a corpus of a single author, HC is mostly affected by words characteristic of the author and is relatively unaffected by topic structure.