Representation of texts as complex networks: a mesoscopic approach
This work addresses the need for better semantic analysis in text mining, offering a novel approach that could benefit applications in natural language processing and document understanding, though it appears incremental in building upon existing network-based methods.
The authors tackled the problem of representing the topical structure of texts at a mesoscopic level, which is often overlooked by existing word-adjacency methods, by devising a network model that analyzes documents in a multi-scale fashion and reveals semantic traits, as demonstrated through a qualitative analysis of 'Alice's Adventures in Wonderland' and a machine learning classification task.
Statistical techniques that analyze texts, referred to as text analytics, have departed from the use of simple word count statistics towards a new paradigm. Text mining now hinges on a more sophisticated set of methods, including the representations in terms of complex networks. While well-established word-adjacency (co-occurrence) methods successfully grasp syntactical features of written texts, they are unable to represent important aspects of textual data, such as its topical structure, i.e. the sequence of subjects developing at a mesoscopic level along the text. Such aspects are often overlooked by current methodologies. In order to grasp the mesoscopic characteristics of semantical content in written texts, we devised a network model which is able to analyze documents in a multi-scale fashion. In the proposed model, a limited amount of adjacent paragraphs are represented as nodes, which are connected whenever they share a minimum semantical content. To illustrate the capabilities of our model, we present, as a case example, a qualitative analysis of "Alice's Adventures in Wonderland". We show that the mesoscopic structure of a document, modeled as a network, reveals many semantic traits of texts. Such an approach paves the way to a myriad of semantic-based applications. In addition, our approach is illustrated in a machine learning context, in which texts are classified among real texts and randomized instances.