CLDATA-ANJul 10, 2012

Distinct word length frequencies: distributions and symbol entropies

arXiv:1207.2334v239 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a linguistic modeling problem for researchers studying vocabulary structure, but it appears incremental as it builds on existing statistical and information-theoretic approaches.

The paper tackles the problem of analyzing the distribution of distinct word lengths in languages by deriving empirical distributions and using information theory to estimate frequency counts and entropies, resulting in methods to compute mean word length, variance, and higher-order entropies.

The distribution of frequency counts of distinct words by length in a language's vocabulary will be analyzed using two methods. The first, will look at the empirical distributions of several languages and derive a distribution that reasonably explains the number of distinct words as a function of length. We will be able to derive the frequency count, mean word length, and variance of word length based on the marginal probability of letters and spaces. The second, based on information theory, will demonstrate that the conditional entropies can also be used to estimate the frequency of distinct words of a given length in a language. In addition, it will be shown how these techniques can also be applied to estimate higher order entropies using vocabulary word length.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes