CL DATA-ANJan 17, 2014

Entropy analysis of word-length series of natural language texts: Effects of text language and genre

Maria Kalimeri, Vassilios Constantoudis, Constantinos Papadimitriou, Kostantinos Karamanos, Fotis K. Diakonos, Haris Papageorgiou

arXiv:1401.4205v121 citations

Originality Synthesis-oriented

AI Analysis

This work addresses a domain-specific problem in computational linguistics for researchers analyzing text structure, but it is incremental as it builds on existing entropy methods.

The study tackled the problem of analyzing word-length series in natural language texts by estimating n-gram entropies, finding that these entropies are sensitive to text language and genre, with key effects attributed to probability distributions of word lengths and correlations.

We estimate the $n$-gram entropies of natural language texts in word-length representation and find that these are sensitive to text language and genre. We attribute this sensitivity to changes in the probability distribution of the lengths of single words and emphasize the crucial role of the uniformity of probabilities of having words with length between five and ten. Furthermore, comparison with the entropies of shuffled data reveals the impact of word length correlations on the estimated $n$-gram entropies.

View on arXiv PDF

Similar