CLDec 31, 2018
Types, Tokens, and Hapaxes: A New Heap's LawVictor Davis
Heap's Law states that in a large enough text corpus, the number of types as a function of tokens grows as $N=KM^β$ for some free parameters $K,β$. Much has been written about how this result and various generalizations can be derived from Zipf's Law. Here we derive from first principles a completely novel expression of the type-token curve and prove its superior accuracy on real text. This expression naturally generalizes to equally accurate estimates for counting hapaxes and higher $n$-legomena.