Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models
This work addresses statistical linguistics by refining fundamental laws for text analysis, but it is incremental as it builds on existing models.
The paper tackled the problem of correcting Zipf's and Heaps' laws by modeling the proportion of hapaxes (words occurring once) in texts, showing that a logistic model provides the best fit among four functions tested.
The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the proportion of hapaxes, i.e., words that occur once. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the hapax rate is a simple function of the text length. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.