Entropy and type-token ratio in gigaword corpora
This work provides a quantitative analysis of lexical diversity across languages and genres, but it is incremental as it builds on established statistical laws without introducing new methods.
The study investigated the relationship between entropy and type-token ratio as measures of lexical diversity across six gigaword corpora in English, Spanish, and Turkish, revealing an empirical functional relation and deriving an analytical expression based on Zipf and Heaps laws for large text lengths.
There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with our empirical findings.