Probing the statistical properties of enriched co-occurrence networks
It provides guidelines for selecting network metrics in applications like short text analysis, but is incremental as it builds on existing enrichment methods.
This study investigated the statistical properties of text-based network models enriched with virtual edges from word embeddings, finding that the impact varies by metric: for example, average shortest path and closeness centrality improve informativeness in short texts, while clustering coefficient decreases with more virtual edges.
Recent studies have explored the addition of virtual edges to word co-occurrence networks using word embeddings to enhance graph representations, particularly for short texts. While these enriched networks have demonstrated some success, the impact of incorporating semantic edges into traditional co-occurrence networks remains uncertain. This study investigates two key statistical properties of text-based network models. First, we assess whether network metrics can effectively distinguish between meaningless and meaningful texts. Second, we analyze whether these metrics are more sensitive to syntactic or semantic aspects of the text. Our results show that incorporating virtual edges can have positive and negative effects, depending on the specific network metric. For instance, the informativeness of the average shortest path and closeness centrality improves in short texts, while the clustering coefficient's informativeness decreases as more virtual edges are added. Additionally, we found that including stopwords affects the statistical properties of enriched networks. Our results can serve as a guideline for determining which network metrics are most appropriate for specific applications, depending on the typical text size and the nature of the problem.