Verifying Heaps' law using Google Books Ngram data
This work provides empirical validation of a linguistic law for researchers in computational linguistics and text analysis, but it is incremental as it applies an existing method to new data.
The researchers verified Heaps' law, which describes the relationship between word frequency and vocabulary size, using Google Books Ngram data for European languages, finding that the Heaps exponent varies significantly over 60-100 year intervals.
This article is devoted to the verification of the empirical Heaps law in European languages using Google Books Ngram corpus data. The connection between word distribution frequency and expected dependence of individual word number on text size is analysed in terms of a simple probability model of text generation. It is shown that the Heaps exponent varies significantly within characteristic time intervals of 60-100 years.