CLDATA-ANJan 7, 2020

Heaps' law and Heaps functions in tagged texts: Evidences of their linguistic relevance

arXiv:2001.02178v114 citations
AI Analysis

This work addresses the linguistic relevance of Heaps' law variants for researchers in computational linguistics, though it is incremental as it builds on existing statistical analyses without introducing new methods.

The study analyzed vocabulary growth in 75 English literary works, finding that while overall vocabulary size follows Heaps' law, the appearance of new words deviates systematically from random shufflings, with verbs and other tags showing distinct retardation patterns.

We study the relationship between vocabulary size and text length in a corpus of $75$ literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or ``tags,'' namely, {\it nouns}, {\it verbs}, and {\it others}), and analyze the progressive appearance of new words of each tag along each individual text. While the power-law relation prescribed by Heaps' law is satisfactorily fulfilled by total vocabulary sizes and text lengths, the appearance of new words in each text is on the whole well described by the average of random shufflings of the text, which does not obey a power law. Deviations from this average, however, are statistically significant and show a systematic trend across the corpus. Specifically, they reveal that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags are shown to add systematically distinct contributions to this tendency, with {\it verbs} and {\it others} being respectively more and less retarded than the mean trend, and {\it nouns} following instead this overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' law, a feature that is still in need of extensive assessment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes