CLMay 4, 2022
Using virtual edges to extract keywords from texts modeled as complex networksJorge A. V. Tohalino, Thiago C. Silva, Diego R. Amancio
Detecting keywords in texts is important for many text mining applications. Graph-based methods have been commonly used to automatically find the key concepts in texts, however, relevant information provided by embeddings has not been widely used to enrich the graph structure. Here we modeled texts co-occurrence networks, where nodes are words and edges are established either by contextual or semantical similarity. We compared two embedding approaches -- Word2vec and BERT -- to check whether edges created via word embeddings can improve the quality of the keyword extraction method. We found that, in fact, the use of virtual edges can improve the discriminability of co-occurrence networks. The best performance was obtained when we considered low percentages of addition of virtual (embedding) edges. A comparative analysis of structural and dynamical network metrics revealed the degree, PageRank, and accessibility are the metrics displaying the best performance in the model enriched with virtual edges.
DLApr 10
Auditing automated research assessment: an interpretable machine learning approach to validate funding criteriaRafael P. Gouveia, Thiago C. Silva, Diego R. Amancio
This paper empirically examines the practical validity of the official evaluation criteria underpinning the Research Productivity (PQ) Grant framework, as governed by the Brazilian National Council for Scientific and Technological Development (CNPq). By operationalizing regulatory dimensions (including bibliographic output, human resource training, and scientific recognition) as measurable variables extracted from CVs and OpenAlex bibliometric data, we treat policy-defined indicators as testable hypotheses rather than a priori assumptions. Using a block-based adaptation of the Boruta feature selection algorithm across several machine learning classifiers, we evaluate the statistical contribution of each dimension in distinguishing grant levels, with a focus on identifying top-tier (Level 1A) researchers. Our models achieve high predictive performance, with mean AUC scores reaching 0.96, indicating that PQ levels carry a robust and structured statistical signal. However, explanatory power is heavily concentrated within a limited subset of features, specifically bibliographic production, graduate-level supervision and institutional management roles. Conversely, several criteria explicitly emphasized in the regulations demonstrated no detectable statistical contribution to classification outcomes. These findings reveal a potential misalignment between the formal regulatory framework and the effective signals driving evaluation outcomes, suggesting that the practical evaluative signal is substantially more compact than officially stated and providing evidence-based insights for the refinement and transparency of research assessment policies.
CLOct 5, 2022
Using Full-Text Content to Characterize and Identify Best Seller BooksGiovana D. da Silva, Filipi N. Silva, Henrique F. de Arruda et al.
Artistic pieces can be studied from several perspectives, one example being their reception among readers over time. In the present work, we approach this interesting topic from the standpoint of literary works, particularly assessing the task of predicting whether a book will become a best seller. Dissimilarly from previous approaches, we focused on the full content of books and considered visualization and classification tasks. We employed visualization for the preliminary exploration of the data structure and properties, involving SemAxis and linear discriminant analyses. Then, to obtain quantitative and more objective results, we employed various classifiers. Such approaches were used along with a dataset containing (i) books published from 1895 to 1924 and consecrated as best sellers by the Publishers Weekly Bestseller Lists and (ii) literary works published in the same period but not being mentioned in that list. Our comparison of methods revealed that the best-achieved result - combining a bag-of-words representation with a logistic regression classifier - led to an average accuracy of 0.75 both for the leave-one-out and 10-fold cross-validations. Such an outcome suggests that it is unfeasible to predict the success of books with high accuracy using only the full content of the texts. Nevertheless, our findings provide insights into the factors leading to the relative success of a literary work.
SIJun 30, 2022
Recovering network topology and dynamics via sequence characterizationLucas Guerreiro, Filipi N. Silva, Diego R. Amancio
Sequences arise in many real-world scenarios; thus, identifying the mechanisms behind symbol generation is essential to understanding many complex systems. This paper analyzes sequences generated by agents walking on a networked topology. Given that in many real scenarios, the underlying processes generating the sequence is hidden, we investigate whether the reconstruction of the network via the co-occurrence method is useful to recover both the network topology and agent dynamics generating sequences. We found that the characterization of reconstructed networks provides valuable information regarding the process and topology used to create the sequences. In a machine learning approach considering 16 combinations of network topology and agent dynamics as classes, we obtained an accuracy of 87% with sequences generated with less than 40% of nodes visited. Larger sequences turned out to generate improved machine learning models. Our findings suggest that the proposed methodology could be extended to classify sequences and understand the mechanisms behind sequence generation.
DLMay 18
From Node2Vec to GPT-based GraphRAG: scientific impact prediction across graph and language modelsAdilson Vital, Filipi N. Silva, Diego R. Amancio
Identifying which newly published scientific papers are likely to become highly cited is important for prioritizing research attention, supporting editorial decisions, and guiding the allocation of scientific resources, particularly under cold-start conditions where little direct evidence is available at publication time. In this work, we formulate impact prediction as a cohort-normalized top-P% classification task and compare graph-based and LLM-based approaches under a unified framework. We construct citation and textual-similarity graphs under temporal constraints and generate Node2Vec representations, either alone or combined with OpenAI text embeddings. The best supervised configuration combines directed citation graphs with textual embeddings, reaching approximately 0.84-0.85 AUC. We also evaluate a GPT-based GraphRAG setup, using GPT 5.5 and 5.4 Nano, in which graph neighborhoods are used as contextual evidence for prediction. Although the LLM-based approach achieves high performance, retrieved context does not consistently improve results; target-only prompts often perform as well as or better than GraphRAG prompts achieving the 0.87 mark. These findings indicate that structural and textual signals are complementary for supervised prediction, while retrieval augmentation must be carefully evaluated against simpler LLM baselines.
GNApr 11, 2024
Machine learning and economic forecasting: the role of international trade networksThiago C. Silva, Paulo V. B. Wilhelm, Diego R. Amancio
This study examines the effects of de-globalization trends on international trade networks and their role in improving forecasts for economic growth. Using section-level trade data from nearly 200 countries from 2010 to 2022, we identify significant shifts in the network topology driven by rising trade policy uncertainty. Our analysis highlights key global players through centrality rankings, with the United States, China, and Germany maintaining consistent dominance. Using a horse race of supervised regressors, we find that network topology descriptors evaluated from section-specific trade networks substantially enhance the quality of a country's GDP growth forecast. We also find that non-linear models, such as Random Forest, XGBoost, and LightGBM, outperform traditional linear models used in the economics literature. Using SHAP values to interpret these non-linear model's predictions, we find that about half of most important features originate from the network descriptors, underscoring their vital role in refining forecasts. Moreover, this study emphasizes the significance of recent economic performance, population growth, and the primary sector's influence in shaping economic growth predictions, offering novel insights into the intricacies of economic growth forecasting.
LGFeb 6, 2025
Graph machine learning for flight delay prediction due to holding manouverJorge L. Franco, Manoel V. Machado Neto, Filipe A. N. Verri et al.
Flight delays due to holding maneuvers are a critical and costly phenomenon in aviation, driven by the need to manage air traffic congestion and ensure safety. Holding maneuvers occur when aircraft are instructed to circle in designated airspace, often due to factors such as airport congestion, adverse weather, or air traffic control restrictions. This study models the prediction of flight delays due to holding maneuvers as a graph problem, leveraging advanced Graph Machine Learning (Graph ML) techniques to capture complex interdependencies in air traffic networks. Holding maneuvers, while crucial for safety, cause increased fuel usage, emissions, and passenger dissatisfaction, making accurate prediction essential for operational efficiency. Traditional machine learning models, typically using tabular data, often overlook spatial-temporal relations within air traffic data. To address this, we model the problem of predicting holding as edge feature prediction in a directed (multi)graph where we apply both CatBoost, enriched with graph features capturing network centrality and connectivity, and Graph Attention Networks (GATs), which excel in relational data contexts. Our results indicate that CatBoost outperforms GAT in this imbalanced dataset, effectively predicting holding events and offering interpretability through graph-based feature importance. Additionally, we discuss the model's potential operational impact through a web-based tool that allows users to simulate real-time delay predictions. This research underscores the viability of graph-based approaches for predictive analysis in aviation, with implications for enhancing fuel efficiency, reducing delays, and improving passenger experience.
DLMay 27, 2025
Leveraging GANs for citation intent classification and its impact on citation network analysisDavi A. Bezerra, Filipi N. Silva, Diego R. Amancio
Citations play a fundamental role in the scientific ecosystem, serving as a foundation for tracking the flow of knowledge, acknowledging prior work, and assessing scholarly influence. In scientometrics, they are also central to the construction of quantitative indicators. Not all citations, however, serve the same function: some provide background, others introduce methods, or compare results. Therefore, understanding citation intent allows for a more nuanced interpretation of scientific impact. In this paper, we adopted a GAN-based method to classify citation intents. Our results revealed that the proposed method achieves competitive classification performance, closely matching state-of-the-art results with substantially fewer parameters. This demonstrates the effectiveness and efficiency of leveraging GAN architectures combined with contextual embeddings in intent classification task. We also investigated whether filtering citation intents affects the centrality of papers in citation networks. Analyzing the network constructed from the unArXiv dataset, we found that paper rankings can be significantly influenced by citation intent. All four centrality metrics examined- degree, PageRank, closeness, and betweenness - were sensitive to the filtering of citation types. The betweenness centrality displayed the greatest sensitivity, showing substantial changes in ranking when specific citation intents were removed.
CLDec 3, 2024
Probing the statistical properties of enriched co-occurrence networksDiego R. Amancio, Jeaneth Machicao, Laura V. C. Quispe
Recent studies have explored the addition of virtual edges to word co-occurrence networks using word embeddings to enhance graph representations, particularly for short texts. While these enriched networks have demonstrated some success, the impact of incorporating semantic edges into traditional co-occurrence networks remains uncertain. This study investigates two key statistical properties of text-based network models. First, we assess whether network metrics can effectively distinguish between meaningless and meaningful texts. Second, we analyze whether these metrics are more sensitive to syntactic or semantic aspects of the text. Our results show that incorporating virtual edges can have positive and negative effects, depending on the specific network metric. For instance, the informativeness of the average shortest path and closeness centrality improves in short texts, while the clustering coefficient's informativeness decreases as more virtual edges are added. Additionally, we found that including stopwords affects the statistical properties of enriched networks. Our results can serve as a guideline for determining which network metrics are most appropriate for specific applications, depending on the typical text size and the nature of the problem.
CLJan 17, 2022
Text characterization based on recurrence networksBárbara C. e Souza, Filipi N. Silva, Henrique F. de Arruda et al.
Several complex systems are characterized by presenting intricate characteristics taking place at several scales of time and space. These multiscale characterizations are used in various applications, including better understanding diseases, characterizing transportation systems, and comparison between cities, among others. In particular, texts are also characterized by a hierarchical structure that can be approached by using multi-scale concepts and methods. The multiscale properties of texts constitute a subject worth further investigation. In addition, more effective approaches to text characterization and analysis can be obtained by emphasizing words with potentially more informational content. The present work aims at developing these possibilities while focusing on mesoscopic representations of networks. More specifically, we adopt an extension to the mesoscopic approach to represent text narratives, in which only the recurrent relationships among tagged parts of speech (subject, verb and direct object) are considered to establish connections among sequential pieces of text (e.g., paragraphs). The characterization of the texts was then achieved by considering scale-dependent complementary methods: accessibility, symmetry and recurrence signatures. In order to evaluate the potential of these concepts and methods, we approached the problem of distinguishing between literary genres (fiction and non-fiction). A set of 300 books organized into the two genres was considered and were compared by using the aforementioned approaches. All the methods were capable of differentiating to some extent between the two genres. The accessibility and symmetry reflected the narrative asymmetries, while the recurrence signature provided a more direct indication about the non-sequential semantic connections taking place along the narrative.
CLJul 18, 2021
A pattern recognition approach for distinguishing between prose and poetryHenrique F. de Arruda, Sandro M. Reia, Filipi N. Silva et al.
Poetry and prose are written artistic expressions that help us to appreciate the reality we live. Each of these styles has its own set of subjective properties, such as rhyme and rhythm, which are easily caught by a human reader's eye and ear. With the recent advances in artificial intelligence, the gap between humans and machines may have decreased, and today we observe algorithms mastering tasks that were once exclusively performed by humans. In this paper, we propose an automated method to distinguish between poetry and prose based solely on aural and rhythmic properties. In other to compare prose and poetry rhythms, we represent the rhymes and phones as temporal sequences and thus we propose a procedure for extracting rhythmic features from these sequences. The classification of the considered texts using the set of features extracted resulted in a best accuracy of 0.78, obtained with a neural network. Interestingly, by using an approach based on complex networks to visualize the similarities between the different texts considered, we found that the patterns of poetry vary much more than prose. Consequently, a much richer and complex set of rhythmic possibilities tends to be found in that modality.
DLJun 20, 2021
On predicting research grants productivityJorge A. V. Tohalino, Diego R. Amancio
Understanding the reasons associated with successful proposals is of paramount importance to improve evaluation processes. In this context, we analyzed whether bibliometric features are able to predict the success of research grants. We extracted features aiming at characterizing the academic history of Brazilian researchers, including research topics, affiliations, number of publications and visibility. The extracted features were then used to predict grants productivity via machine learning in three major research areas, namely Medicine, Dentistry and Veterinary Medicine. We found that research subject and publication history play a role in predicting productivity. In addition, institution-based features turned out to be relevant when combined with other features. While the best results outperformed text-based attributes, the evaluated features were not highly discriminative. Our findings indicate that predicting grants success, at least with the considered set of bibliometric features, is not a trivial task.
CLOct 13, 2020
Language Networks: a Practical ApproachJorge A. V. Tohalino, Diego R. Amancio
This manuscript provides a short and practical introduction to the topic of language networks. This text aims at assisting researchers with no practical experience in text and/or network analysis. We provide a practical tutorial on how to model and characterize texts using network-based features. In this tutorial, we also include examples of pre-processing and network representations. A brief description of the main tasks allying network science and text analysis is also provided. A further development of this text shall include a practical description of network classification via machine learning methods.
CLMar 13, 2020
Using word embeddings to improve the discriminability of co-occurrence text networksLaura V. C. Quispe, Jorge A. V. Tohalino, Diego R. Amancio
Word co-occurrence networks have been employed to analyze texts both in the practical and theoretical scenarios. Despite the relative success in several applications, traditional co-occurrence networks fail in establishing links between similar words whenever they appear distant in the text. Here we investigate whether the use of word embeddings as a tool to create virtual links in co-occurrence networks may improve the quality of classification systems. Our results revealed that the discriminability in the stylometry task is improved when using Glove, Word2Vec and FastText. In addition, we found that optimized results are obtained when stopwords are not disregarded and a simple global thresholding strategy is used to establish virtual links. Because the proposed approach is able to improve the representation of texts as complex networks, we believe that it could be extended to study other natural language processing tasks. Likewise, theoretical languages studies could benefit from the adopted enriched representation of word co-occurrence networks.
CLMay 18, 2019
Semantic flow in language networksEdilson A. Corrêa, Vanessa Q. Marinho, Diego R. Amancio
In this study we propose a framework to characterize documents based on their semantic flow. The proposed framework encompasses a network-based model that connected sentences based on their semantic similarity. Semantic fields are detected using standard community detection methods. as the story unfolds, transitions between semantic fields are represent in Markov networks, which in turned are characterized via network motifs (subgraphs). Here we show that the proposed framework can be used to classify books according to their style and publication dates. Remarkably, even without a systematic optimization of parameters, philosophy and investigative books were discriminated with an accuracy rate of 92.5%. Because this model captures semantic features of texts, it could be used as an additional feature in traditional network-based models of texts that capture only syntactical/stylistic information, as it is the case of word adjacency (co-occurrence) networks.
CLJun 22, 2018
Paragraph-based complex networks: application to document classification and authenticity verificationHenrique F. de Arruda, Vanessa Q. Marinho, Luciano da F. Costa et al.
With the increasing number of texts made available on the Internet, many applications have relied on text mining tools to tackle a diversity of problems. A relevant model to represent texts is the so-called word adjacency (co-occurrence) representation, which is known to capture mainly syntactical features of texts.In this study, we introduce a novel network representation that considers the semantic similarity between paragraphs. Two main properties of paragraph networks are considered: (i) their ability to incorporate characteristics that can discriminate real from artificial, shuffled manuscripts and (ii) their ability to capture syntactical and semantic textual features. Our results revealed that real texts are organized into communities, which turned out to be an important feature for discriminating them from artificial texts. Interestingly, we have also found that, differently from traditional co-occurrence networks, the adopted representation is able to capture semantic features. Additionally, the proposed framework was employed to analyze the Voynich manuscript, which was found to be compatible with texts written in natural languages. Taken together, our findings suggest that the proposed methodology can be combined with traditional network models to improve text classification tasks.
CLMar 22, 2018
Word sense induction using word embeddings and community detection in complex networksEdilson A. Corrêa, Diego R. Amancio
Word Sense Induction (WSI) is the ability to automatically induce word senses from corpora. The WSI task was first proposed to overcome the limitations of manually annotated corpus that are required in word sense disambiguation systems. Even though several works have been proposed to induce word senses, existing systems are still very limited in the sense that they make use of structured, domain-specific knowledge sources. In this paper, we devise a method that leverages recent findings in word embeddings research to generate context embeddings, which are embeddings containing information about the semantical context of a word. In order to induce senses, we modeled the set of ambiguous words as a complex network. In the generated network, two instances (nodes) are connected if the respective context embeddings are similar. Upon using well-established community detection methods to cluster the obtained context embeddings, we found that the proposed method yields excellent performance for the WSI task. Our method outperformed competing algorithms and baselines, in a completely unsupervised manner and without the need of any additional structured knowledge source.
CLNov 7, 2017
Extractive Multi-document Summarization Using Multilayer NetworksJorge V. Tohalino, Diego R. Amancio
Huge volumes of textual information has been produced every single day. In order to organize and understand such large datasets, in recent years, summarization techniques have become popular. These techniques aims at finding relevant, concise and non-redundant content from such a big data. While network methods have been adopted to model texts in some scenarios, a systematic evaluation of multilayer network models in the multi-document summarization task has been limited to a few studies. Here, we evaluate the performance of a multilayer-based method to select the most relevant sentences in the context of an extractive multi document summarization (MDS) task. In the adopted model, nodes represent sentences and edges are created based on the number of shared words between sentences. Differently from previous studies in multi-document summarization, we make a distinction between edges linking sentences from different documents (inter-layer) and those connecting sentences from the same document (intra-layer). As a proof of principle, our results reveal that such a discrimination between intra- and inter-layer in a multilayered representation is able to improve the quality of the generated summaries. This piece of information could be used to improve current statistical methods and related textual models.
CLAug 24, 2017
An Image Analysis Approach to the Calligraphy of BooksHenrique F. de Arruda, Vanessa Q. Marinho, Thales S. Lima et al.
Text network analysis has received increasing attention as a consequence of its wide range of applications. In this work, we extend a previous work founded on the study of topological features of mesoscopic networks. Here, the geometrical properties of visualized networks are quantified in terms of several image analysis techniques and used as subsidies for authorship attribution. It was found that the visual features account for performance similar to that achieved by using topological measurements. In addition, the combination of these two types of features improved the performance.
CLAug 5, 2017
Extractive Multi Document Summarization using Dynamical Measurements of Complex NetworksJorge V. Tohalino, Diego R. Amancio
Due to the large amount of textual information available on Internet, it is of paramount relevance to use techniques that find relevant and concise content. A typical task devoted to the identification of informative sentences in documents is the so called extractive document summarization task. In this paper, we use complex network concepts to devise an extractive Multi Document Summarization (MDS) method, which extracts the most central sentences from several textual sources. In the proposed model, texts are represented as networks, where nodes represent sentences and the edges are established based on the number of shared words. Differently from previous works, the identification of relevant terms is guided by the characterization of nodes via dynamical measurements of complex networks, including symmetry, accessibility and absorption time. The evaluation of the proposed system revealed that excellent results were obtained with particular dynamical measurements, including those based on the exploration of networks via random walks.
CLMay 29, 2017
On the "Calligraphy" of BooksVanessa Q. Marinho, Henrique F. de Arruda, Thales S. Lima et al.
Authorship attribution is a natural language processing task that has been widely studied, often by considering small order statistics. In this paper, we explore a complex network approach to assign the authorship of texts based on their mesoscopic representation, in an attempt to capture the flow of the narrative. Indeed, as reported in this work, such an approach allowed the identification of the dominant narrative structure of the studied authors. This has been achieved due to the ability of the mesoscopic approach to take into account relationships between different, not necessarily adjacent, parts of the text, which is able to capture the story flow. The potential of the proposed approach has been illustrated through principal component analysis, a comparison with the chance baseline method, and network visualization. Such visualizations reveal individual characteristics of the authors, which can be understood as a kind of calligraphy.
CLMay 11, 2017
On the role of words in the network structure of texts: application to authorship attributionCamilo Akimushkin, Diego R. Amancio, Osvaldo N. Oliveira
Well-established automatic analyses of texts mainly consider frequencies of linguistic units, e.g. letters, words and bigrams, while methods based on co-occurrence networks consider the structure of texts regardless of the nodes label (i.e. the words semantics). In this paper, we reconcile these distinct viewpoints by introducing a generalized similarity measure to compare texts which accounts for both the network structure of texts and the role of individual words in the networks. We use the similarity measure for authorship attribution of three collections of books, each composed of 8 authors and 10 books per author. High accuracy rates were obtained with typical values from 90% to 98.75%, much higher than with the traditional the TF-IDF approach for the same collections. These accuracies are also higher than taking only the topology of networks into account. We conclude that the different properties of specific words on the macroscopic scale structure of a whole text are as relevant as their frequency of appearance; conversely, considering the identity of nodes brings further knowledge about a piece of text represented as a network.
CLMay 1, 2017
Labelled network subgraphs reveal stylistic subtleties in written textsVanessa Q. Marinho, Graeme Hirst, Diego R. Amancio
The vast amount of data and increase of computational capacity have allowed the analysis of texts from several perspectives, including the representation of texts as complex networks. Nodes of the network represent the words, and edges represent some relationship, usually word co-occurrence. Even though networked representations have been applied to study some tasks, such approaches are not usually combined with traditional models relying upon statistical paradigms. Because networked models are able to grasp textual patterns, we devised a hybrid classifier, called labelled subgraphs, that combines the frequency of common words with small structures found in the topology of the network, known as motifs. Our approach is illustrated in two contexts, authorship attribution and translationese identification. In the former, a set of novels written by different authors is analyzed. To identify translationese, texts from the Canadian Hansard and the European parliament were classified as to original and translated instances. Our results suggest that labelled subgraphs are able to represent texts and it should be further explored in other tasks, such as the analysis of text complexity, language proficiency, and machine translation.
CLApr 26, 2017
Enriching Complex Networks with Word Embeddings for Detecting Mild Cognitive Impairment from Speech TranscriptsLeandro B. dos Santos, Edilson A. Corrêa, Osvaldo N. Oliveira et al.
Mild Cognitive Impairment (MCI) is a mental disorder difficult to diagnose. Linguistic features, mainly from parsers, have been used to detect MCI, but this is not suitable for large-scale assessments. MCI disfluencies produce non-grammatical speech that requires manual or high precision automatic correction of transcripts. In this paper, we modeled transcripts into complex networks and enriched them with word embedding (CNE) to better represent short texts produced in neuropsychological assessments. The network measurements were applied with well-known classifiers to automatically identify MCI in transcripts, in a binary classification task. A comparison was made with the performance of traditional approaches using Bag of Words (BoW) and linguistic features for three datasets: DementiaBank in English, and Cinderella and Arizona-Battery in Portuguese. Overall, CNE provided higher accuracy than using only complex networks, while Support Vector Machine was superior to other classifiers. CNE provided the highest accuracies for DementiaBank and Cinderella, but BoW was more efficient for the Arizona-Battery dataset probably owing to its short narratives. The approach using linguistic features yielded higher accuracy if the transcriptions of the Cinderella dataset were manually revised. Taken together, the results indicate that complex networks enriched with embedding is promising for detecting MCI in large-scale assessments
LGDec 26, 2016
Clustering Algorithms: A Comparative ApproachMayra Z. Rodriguez, Cesar H. Comin, Dalcimar Casanova et al.
Many real-world systems can be studied in terms of pattern recognition tasks, so that proper use (and understanding) of machine learning methods in practical applications becomes essential. While a myriad of classification methods have been proposed, there is no consensus on which methods are more suitable for a given dataset. As a consequence, it is important to comprehensively compare methods in many possible scenarios. In this context, we performed a systematic comparison of 7 well-known clustering methods available in the R language. In order to account for the many possible variations of data, we considered artificial datasets with several tunable properties (number of classes, separation between classes, etc). In addition, we also evaluated the sensitivity of the clustering methods with regard to their parameters configuration. The results revealed that, when considering the default configurations of the adopted methods, the spectral approach usually outperformed the other clustering algorithms. We also found that the default configuration of the adopted implementations was not accurate. In these cases, a simple approach based on random selection of parameters values proved to be a good alternative to improve the performance. All in all, the reported approach provides subsidies guiding the choice of clustering algorithms.
CLOct 20, 2016
Authorship Attribution Based on Life-Like Network AutomataJeaneth Machicao, Edilson A. Corrêa, Gisele H. B. Miranda et al.
The authorship attribution is a problem of considerable practical and technical interest. Several methods have been designed to infer the authorship of disputed documents in multiple contexts. While traditional statistical methods based solely on word counts and related measurements have provided a simple, yet effective solution in particular cases; they are prone to manipulation. Recently, texts have been successfully modeled as networks, where words are represented by nodes linked according to textual similarity measurements. Such models are useful to identify informative topological patterns for the authorship recognition task. However, there is no consensus on which measurements should be used. Thus, we proposed a novel method to characterize text networks, by considering both topological and dynamical aspects of networks. Using concepts and methods from cellular automata theory, we devised a strategy to grasp informative spatio-temporal patterns from this model. Our experiments revealed an outperformance over traditional analysis relying only on topological measurements. Remarkably, we have found a dependence of pre-processing steps (such as the lemmatization) on the obtained results, a feature that has mostly been disregarded in related works. The optimized results obtained here pave the way for a better characterization of textual networks.
CLJul 29, 2016
Text authorship identified using the dynamics of word co-occurrence networksCamilo Akimushkin, Diego R. Amancio, Osvaldo N. Oliveira
The identification of authorship in disputed documents still requires human expertise, which is now unfeasible for many tasks owing to the large volumes of text and authors in practical applications. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. The series were proven to be stationary (p-value>0.05), which permits to use distribution moments as learning attributes. With an optimized supervised learning procedure using a Radial Basis Function Network, 68 out of 80 texts were correctly classified, i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in purely dynamic network metrics were found to characterize authorship, thus opening the way for the description of texts in terms of small evolving networks. Moreover, the approach introduced allows for comparison of texts with diverse characteristics in a simple, fast fashion.
CLJun 30, 2016
Representation of texts as complex networks: a mesoscopic approachHenrique F. de Arruda, Filipi N. Silva, Vanessa Q. Marinho et al.
Statistical techniques that analyze texts, referred to as text analytics, have departed from the use of simple word count statistics towards a new paradigm. Text mining now hinges on a more sophisticated set of methods, including the representations in terms of complex networks. While well-established word-adjacency (co-occurrence) methods successfully grasp syntactical features of written texts, they are unable to represent important aspects of textual data, such as its topical structure, i.e. the sequence of subjects developing at a mesoscopic level along the text. Such aspects are often overlooked by current methodologies. In order to grasp the mesoscopic characteristics of semantical content in written texts, we devised a network model which is able to analyze documents in a multi-scale fashion. In the proposed model, a limited amount of adjacent paragraphs are represented as nodes, which are connected whenever they share a minimum semantical content. To illustrate the capabilities of our model, we present, as a case example, a qualitative analysis of "Alice's Adventures in Wonderland". We show that the mesoscopic structure of a document, modeled as a network, reveals many semantic traits of texts. Such an approach paves the way to a myriad of semantic-based applications. In addition, our approach is illustrated in a machine learning context, in which texts are classified among real texts and randomized instances.
CLJun 25, 2016
Word sense disambiguation: a complex network approachEdilson A. Correa, Alneu de Andrade Lopes, Diego R. Amancio
In recent years, concepts and methods of complex networks have been employed to tackle the word sense disambiguation (WSD) task by representing words as nodes, which are connected if they are semantically similar. Despite the increasingly number of studies carried out with such models, most of them use networks just to represent the data, while the pattern recognition performed on the attribute space is performed using traditional learning techniques. In other words, the structural relationship between words have not been explicitly used in the pattern recognition process. In addition, only a few investigations have probed the suitability of representations based on bipartite networks and graphs (bigraphs) for the problem, as many approaches consider all possible links between words. In this context, we assess the relevance of a bipartite network model representing both feature words (i.e. the words characterizing the context) and target (ambiguous) words to solve ambiguities in written texts. Here, we focus on the semantical relationships between these two type of words, disregarding the relationships between feature words. In special, the proposed method not only serves to represent texts as graphs, but also constructs a structure on which the discrimination of senses is accomplished. Our results revealed that the proposed learning algorithm in such bipartite networks provides excellent results mostly when topical features are employed to characterize the context. Surprisingly, our method even outperformed the support vector machine algorithm in particular cases, with the advantage of being robust even if a small training dataset is available. Taken together, the results obtained here show that the proposed representation/classification method might be useful to improve the semantical characterization of written texts.
SOC-PHJun 17, 2016
Complex systems: features, similarity and connectivityCesar H. Comin, Thomas K. DM. Peron, Filipi N. Silva et al.
The increasing interest in complex networks research has been a consequence of several intrinsic features of this area, such as the generality of the approach to represent and model virtually any discrete system, and the incorporation of concepts and methods deriving from many areas, from statistical physics to sociology, which are often used in an independent way. Yet, for this same reason, it would be desirable to integrate these various aspects into a more coherent and organic framework, which would imply in several benefits normally allowed by the systematization in science, including the identification of new types of problems and the cross-fertilization between fields. More specifically, the identification of the main areas to which the concepts frequently used in complex networks can be applied paves the way to adopting and applying a larger set of concepts and methods deriving from those respective areas. Among the several areas that have been used in complex networks research, pattern recognition, optimization, linear algebra, and time series analysis seem to play a more basic and recurrent role. In the present manuscript, we propose a systematic way to integrate the concepts from these diverse areas regarding complex networks research. In order to do so, we start by grouping the multidisciplinary concepts into three main groups, namely features, similarity, and network connectivity. Then we show that several of the analysis and modeling approaches to complex networks can be thought as a composition of maps between these three groups, with emphasis on nine main types of mappings, which are presented and illustrated. Such a systematization of principles and approaches also provides an opportunity to review some of the most closely related works in the literature, which is also developed in this article.
CLDec 4, 2015
Topic segmentation via community detection in complex networksHenrique F. de Arruda, Luciano da F. Costa, Diego R. Amancio
Many real systems have been modelled in terms of network concepts, and written texts are a particular example of information networks. In recent years, the use of network methods to analyze language has allowed the discovery of several interesting findings, including the proposition of novel models to explain the emergence of fundamental universal patterns. While syntactical networks, one of the most prevalent networked models of written texts, display both scale-free and small-world properties, such representation fails in capturing other textual features, such as the organization in topics or subjects. In this context, we propose a novel network representation whose main purpose is to capture the semantical relationships of words in a simple way. To do so, we link all words co-occurring in the same semantic context, which is defined in a threefold way. We show that the proposed representations favours the emergence of communities of semantically related words, and this feature may be used to identify relevant topics. The proposed methodology to detect topics was applied to segment selected Wikipedia articles. We have found that, in general, our methods outperform traditional bag-of-words representations, which suggests that a high-level textual representation may be useful to study semantical features of texts.
CLSep 17, 2015
Network analysis of named entity co-occurrences in written textsDiego R. Amancio
The use of methods borrowed from statistics and physics to analyze written texts has allowed the discovery of unprecedent patterns of human behavior and cognition by establishing links between models features and language structure. While current models have been useful to unveil patterns via analysis of syntactical and semantical networks, only a few works have probed the relevance of investigating the structure arising from the relationship between relevant entities such as characters, locations and organizations. In this study, we represent entities appearing in the same context as a co-occurrence network, where links are established according to a null model based on random, shuffled texts. Computational simulations performed in novels revealed that the proposed model displays interesting topological features, such as the small world feature, characterized by high values of clustering coefficient. The effectiveness of our model was verified in a practical pattern recognition task in real networks. When compared with traditional word adjacency networks, our model displayed optimized results in identifying unknown references in texts. Because the proposed representation plays a complementary role in characterizing unstructured documents via topological analysis of named entities, we believe that it could be useful to improve the characterization of written texts (and related systems), specially if combined with traditional approaches based on statistical and deeper paradigms.
CLJul 28, 2015
Classifying informative and imaginative prose using complex networksHenrique F. de Arruda, Luciano da F. Costa, Diego R. Amancio
Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, which encompasses machine translation, automatic summarization and document classification. In the latter, many approaches have emphasized the semantical content of texts, as it is the case of bag-of-word language models. This approach has certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only on a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterizing texts.
CLJun 30, 2015
A complex network approach to stylometryDiego R. Amancio
Statistical methods have been widely employed to study the fundamental properties of language. In recent years, methods from complex and dynamical systems proved useful to create several language models. Despite the large amount of studies devoted to represent texts with physical models, only a limited number of studies have shown how the properties of the underlying physical systems can be employed to improve the performance of natural language processing tasks. In this paper, I address this problem by devising complex networks methods that are able to improve the performance of current statistical methods. Using a fuzzy classification strategy, I show that the topological properties extracted from texts complement the traditional textual description. In several cases, the performance obtained with hybrid approaches outperformed the results obtained when only traditional or networked methods were used. Because the proposed model is generic, the framework devised here could be straightforwardly used to study similar textual applications where the topology plays a pivotal role in the description of the interacting agents.
CLJun 18, 2015
Comparing the writing style of real and artificial papersDiego R. Amancio
Recent years have witnessed the increase of competition in science. While promoting the quality of research in many cases, an intense competition among scientists can also trigger unethical scientific behaviors. To increase the total number of published papers, some authors even resort to software tools that are able to produce grammatical, but meaningless scientific manuscripts. Because automatically generated papers can be misunderstood as real papers, it becomes of paramount importance to develop means to identify these scientific frauds. In this paper, I devise a methodology to distinguish real manuscripts from those generated with SCIGen, an automatic paper generator. Upon modeling texts as complex networks (CN), it was possible to discriminate real from fake papers with at least 89\% of accuracy. A systematic analysis of features relevance revealed that the accessibility and betweenness were useful in particular cases, even though the relevance depended upon the dataset. The successful application of the methods described here show, as a proof of principle, that network features can be used to identify scientific gibberish papers. In addition, the CN-based approach can be combined in a straightforward fashion with traditional statistical language processing methods to improve the performance in identifying artificially generated papers.
CLApr 9, 2015
Concentric network symmetry grasps authors' styles in word adjacency networksDiego R. Amancio, Filipi N. Silva, Luciano da F. Costa
Several characteristics of written texts have been inferred from statistical analysis derived from networked models. Even though many network measurements have been adapted to study textual properties at several levels of complexity, some textual aspects have been disregarded. In this paper, we study the symmetry of word adjacency networks, a well-known representation of text as a graph. A statistical analysis of the symmetry distribution performed in several novels showed that most of the words do not display symmetric patterns of connectivity. More specifically, the merged symmetry displayed a distribution similar to the ubiquitous power-law distribution. Our experiments also revealed that the studied metrics do not correlate with other traditional network measurements, such as the degree or betweenness centrality. The effectiveness of the symmetry measurements was verified in the authorship attribution task. Interestingly, we found that specific authors prefer particular types of symmetric motifs. As a consequence, the authorship of books could be accurately identified in 82.5% of the cases, in a dataset comprising books written by 8 authors. Because the proposed measurements for text analysis are complementary to the traditional approach, they can be used to improve the characterization of text networks, which might be useful for related applications, such as those relying on the identification of topical words and information retrieval.
CLFeb 4, 2015
Authorship recognition via fluctuation analysis of network topology and word intermittencyDiego R. Amancio
Statistical methods have been widely employed in many practical natural language processing applications. More specifically, complex networks concepts and methods from dynamical systems theory have been successfully applied to recognize stylistic patterns in written texts. Despite the large amount of studies devoted to represent texts with physical models, only a few studies have assessed the relevance of attributes derived from the analysis of stylistic fluctuations. Because fluctuations represent a pivotal factor for characterizing a myriad of real systems, this study focused on the analysis of the properties of stylistic fluctuations in texts via topological analysis of complex networks and intermittency measurements. The results showed that different authors display distinct fluctuation patterns. In particular, it was found that it is possible to identify the authorship of books using the intermittency of specific words. Taken together, the results described here suggest that the patterns found in stylistic fluctuations could be used to analyze other related complex systems. Furthermore, the discovery of novel patterns related to textual stylistic fluctuations indicates that these patterns could be useful to improve the state of the art of many stylistic-based natural language processing tasks.
CLDec 29, 2014
Probing the topological properties of complex networks modeling short written textsDiego R. Amancio
In recent years, graph theory has been widely employed to probe several language properties. More specifically, the so-called word adjacency model has been proven useful for tackling several practical problems, especially those relying on textual stylistic analysis. The most common approach to treat texts as networks has simply considered either large pieces of texts or entire books. This approach has certainly worked well -- many informative discoveries have been made this way -- but it raises an uncomfortable question: could there be important topological patterns in small pieces of texts? To address this problem, the topological properties of subtexts sampled from entire books was probed. Statistical analyzes performed on a dataset comprising 50 novels revealed that most of the traditional topological measurements are stable for short subtexts. When the performance of the authorship recognition task was analyzed, it was found that a proper sampling yields a discriminability similar to the one found with full texts. Surprisingly, the support vector machine classification based on the characterization of short texts outperformed the one performed with entire books. These findings suggest that a local topological analysis of large documents might improve its global characterization. Most importantly, it was verified, as a proof of principle, that short texts can be analyzed with the methods and concepts of complex networks. As a consequence, the techniques described here can be extended in a straightforward fashion to analyze texts as time-varying complex networks.
CLDec 3, 2014
A perspective on the advancement of natural language processing tasks via topological analysis of complex networksDiego R. Amancio
Comment on "Approaching human language with complex networks" by Cong and Liu (Physics of Life Reviews, Volume 11, Issue 4, December 2014, Pages 598-618).
CLJun 17, 2013
Discriminating word senses with tourist walks in complex networksThiago C. Silva, Diego R. Amancio
Patterns of topological arrangement are widely used for both animal and human brains in the learning process. Nevertheless, automatic learning techniques frequently overlook these patterns. In this paper, we apply a learning technique based on the structural organization of the data in the attribute space to the problem of discriminating the senses of 10 polysemous words. Using two types of characterization of meanings, namely semantical and topological approaches, we have observed significative accuracy rates in identifying the suitable meanings in both techniques. Most importantly, we have found that the characterization based on the deterministic tourist walk improves the disambiguation process when one compares with the discrimination achieved with traditional complex networks measurements such as assortativity and clustering coefficient. To our knowledge, this is the first time that such deterministic walk has been applied to such a kind of problem. Therefore, our finding suggests that the tourist walk characterization may be useful in other related applications.
CLMar 2, 2013
Structure-semantics interplay in complex networks and its effects on the predictability of similarity in textsDiego R. Amancio, Osvaldo N. Oliveira, Luciano da F. Costa
There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between the various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies.
SOC-PHMar 2, 2013
Probing the statistical properties of unknown texts: application to the Voynich ManuscriptDiego R. Amancio, Eduardo G. Altmann, Diego Rybski et al.
While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed investigating the properties of statistical measurements across different languages and texts. In this study we propose a framework that aims at determining if a text is compatible with a natural language and which languages are closest to it, without any knowledge of the meaning of the words. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing text, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for key-words of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.
SOC-PHFeb 19, 2013
On the use of topological features and hierarchical characterization for disambiguating names in collaborative networksDiego R. Amancio, Osvaldo N. Oliveira, Luciano da F. Costa
Many features of complex systems can now be unveiled by applying statistical physics methods to treat them as social networks. The power of the analysis may be limited, however, by the presence of ambiguity in names, e.g., caused by homonymy in collaborative networks. In this paper we show that the ability to distinguish between homonymous authors is enhanced when longer-distance connections are considered, rather than looking at only the immediate neighbors of a node in the collaborative network. Optimized results were obtained upon using the 3rd hierarchy in connections. Furthermore, reasonable distinction among authors could also be achieved upon using pattern recognition strategies for the data generated from the topology of the collaborative network. These results were obtained with a network from papers in the arXiv repository, into which homonymy was deliberately introduced to test the methods with a controlled, reliable dataset. In all cases, several methods of supervised and unsupervised machine learning were used, leading to the same overall results. The suitability of using deeper hierarchies and network topology was confirmed with a real database of movie actors, with the additional finding that the distinguishing ability can be further enhanced by combining topology features and long-range connections in the collaborative network.
SOC-PHFeb 19, 2013
Complex networks analysis of language complexityDiego R. Amancio, Sandra M. Aluisio, Osvaldo N. Oliveira et al.
Methods from statistical physics, such as those involving complex networks, have been increasingly used in quantitative analysis of linguistic phenomena. In this paper, we represented pieces of text with different levels of simplification in co-occurrence networks and found that topological regularity correlated negatively with textual complexity. Furthermore, in less complex texts the distance between concepts, represented as nodes, tended to decrease. The complex networks metrics were treated with multivariate pattern recognition techniques, which allowed us to distinguish between original texts and their simplified versions. For each original text, two simplified versions were generated manually with increasing number of simplification operations. As expected, distinction was easier for the strongly simplified versions, where the most relevant metrics were node strength, shortest paths and diversity. Also, the discrimination of complex texts was improved with higher hierarchical network metrics, thus pointing to the usefulness of considering wider contexts around the concepts. Though the accuracy rate in the distinction was not as high as in methods using deep linguistic knowledge, the complex network approach is still useful for a rapid screening of texts whenever assessing complexity is essential to guarantee accessibility to readers with limited reading ability
SOC-PHFeb 18, 2013
Word sense disambiguation via high order of learning in complex networksThiago C. Silva, Diego R. Amancio
Complex networks have been employed to model many real systems and as a modeling tool in a myriad of applications. In this paper, we use the framework of complex networks to the problem of supervised classification in the word disambiguation task, which consists in deriving a function from the supervised (or labeled) training data of ambiguous words. Traditional supervised data classification takes into account only topological or physical features of the input data. On the other hand, the human (animal) brain performs both low- and high-level orders of learning and it has facility to identify patterns according to the semantic meaning of the input data. In this paper, we apply a hybrid technique which encompasses both types of learning in the field of word sense disambiguation and show that the high-level order of learning can really improve the accuracy rate of the model. This evidence serves to demonstrate that the internal structures formed by the words do present patterns that, generally, cannot be correctly unveiled by only traditional techniques. Finally, we exhibit the behavior of the model for different weights of the low- and high-level classifiers by plotting decision boundaries. This study helps one to better understand the effectiveness of the model.
SOC-PHFeb 18, 2013
Unveiling the relationship between complex networks metrics and word sensesDiego R. Amancio, Osvaldo N. Oliveira, Luciano da F. Costa
The automatic disambiguation of word senses (i.e., the identification of which of the meanings is used in a given context for a word that has multiple meanings) is essential for such applications as machine translation and information retrieval, and represents a key step for developing the so-called Semantic Web. Humans disambiguate words in a straightforward fashion, but this does not apply to computers. In this paper we address the problem of Word Sense Disambiguation (WSD) by treating texts as complex networks, and show that word senses can be distinguished upon characterizing the local structure around ambiguous words. Our goal was not to obtain the best possible disambiguation system, but we nevertheless found that in half of the cases our approach outperforms traditional shallow methods. We show that the hierarchical connectivity and clustering of words are usually the most relevant features for WSD. The results reported here shine light on the relationship between semantic and structural parameters of complex networks. They also indicate that when combined with traditional techniques the complex network approach may be useful to enhance the discrimination of senses in large texts