CLAug 15, 2012

More than Word Frequencies: Authorship Attribution via Natural Frequency Zoned Word Distribution Analysis

arXiv:1208.3001v18 citations
Originality Incremental advance
AI Analysis

This addresses the problem of verifying authorship in digital texts for applications like plagiarism detection, though it appears incremental as it builds on existing text style analysis methods.

The paper tackles authorship attribution of digital texts by introducing natural frequency zoned word distribution analysis (NFZ-WDA), which analyzes word distribution patterns beyond simple frequencies, and demonstrates its efficiency through experimental studies.

With such increasing popularity and availability of digital text data, authorships of digital texts can not be taken for granted due to the ease of copying and parsing. This paper presents a new text style analysis called natural frequency zoned word distribution analysis (NFZ-WDA), and then a basic authorship attribution scheme and an open authorship attribution scheme for digital texts based on the analysis. NFZ-WDA is based on the observation that all authors leave distinct intrinsic word usage traces on texts written by them and these intrinsic styles can be identified and employed to analyze the authorship. The intrinsic word usage styles can be estimated through the analysis of word distribution within a text, which is more than normal word frequency analysis and can be expressed as: which groups of words are used in the text; how frequently does each group of words occur; how are the occurrences of each group of words distributed in the text. Next, the basic authorship attribution scheme and the open authorship attribution scheme provide solutions for both closed and open authorship attribution problems. Through analysis and extensive experimental studies, this paper demonstrates the efficiency of the proposed method for authorship attribution.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes