Generalized Entropies and the Similarity of Texts
This work provides incremental insights for researchers in computational linguistics and text analysis by refining entropy-based text similarity methods.
The authors tackled the problem of understanding which word frequencies dominate generalized entropies and divergences in texts, showing that these measures are dominated by words in a specific frequency range, and they estimated database sizes needed for reliable divergence estimation using large book and scientific paper datasets.
We show how generalized Gibbs-Shannon entropies can provide new insights on the statistical properties of texts. The universal distribution of word frequencies (Zipf's law) implies that the generalized entropies, computed at the word level, are dominated by words in a specific range of frequencies. Here we show that this is the case not only for the generalized entropies but also for the generalized (Jensen-Shannon) divergences, used to compute the similarity between different texts. This finding allows us to identify the contribution of specific words (and word frequencies) for the different generalized entropies and also to estimate the size of the databases needed to obtain a reliable estimation of the divergences. We test our results in large databases of books (from the Google n-gram database) and scientific papers (indexed by Web of Science).