Compression and the origins of Zipf's law of abbreviation
This provides a theoretical explanation for a universal pattern in human language and other domains, but it is incremental as it builds on existing information theory concepts.
The paper tackles the problem of explaining the origins of Zipf's law of abbreviation, which shows that more frequent words tend to be shorter, by generalizing an information theoretic cost function to show that minimizing this cost leads to a negative correlation between probability and magnitude.
Languages across the world exhibit Zipf's law of abbreviation, namely more frequent words tend to be shorter. The generalized version of the law - an inverse relationship between the frequency of a unit and its magnitude - holds also for the behaviours of other species and the genetic code. The apparent universality of this pattern in human language and its ubiquity in other domains calls for a theoretical understanding of its origins. To this end, we generalize the information theoretic concept of mean code length as a mean energetic cost function over the probability and the magnitude of the types of the repertoire. We show that the minimization of that cost function and a negative correlation between probability and the magnitude of types are intimately related.