CLDec 31, 2018

Types, Tokens, and Hapaxes: A New Heap's Law

arXiv:1901.00521v18 citations

Originality Incremental advance

AI Analysis

This work provides a more accurate model for linguistic and computational text analysis, though it appears incremental as it builds upon existing laws like Heap's and Zipf's.

The authors tackled the problem of modeling the type-token relationship in text corpora, deriving a new expression from first principles that proves superior accuracy compared to Heap's Law and generalizes to estimate hapaxes and higher n-legomena.

Heap's Law states that in a large enough text corpus, the number of types as a function of tokens grows as $N=KM^β$ for some free parameters $K,β$. Much has been written about how this result and various generalizations can be derived from Zipf's Law. Here we derive from first principles a completely novel expression of the type-token curve and prove its superior accuracy on real text. This expression naturally generalizes to equally accurate estimates for counting hapaxes and higher $n$-legomena.

View on arXiv PDF

Similar