CLSOC-PHDec 29, 2016

Verifying Heaps' law using Google Books Ngram data

arXiv:1612.09213v12 citations
Originality Synthesis-oriented
AI Analysis

This work provides empirical validation of a linguistic law for researchers in computational linguistics and text analysis, but it is incremental as it applies an existing method to new data.

The researchers verified Heaps' law, which describes the relationship between word frequency and vocabulary size, using Google Books Ngram data for European languages, finding that the Heaps exponent varies significantly over 60-100 year intervals.

This article is devoted to the verification of the empirical Heaps law in European languages using Google Books Ngram corpus data. The connection between word distribution frequency and expected dependence of individual word number on text size is analysed in terms of a simple probability model of text generation. It is shown that the Heaps exponent varies significantly within characteristic time intervals of 60-100 years.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes