SOC-PHCLDATA-ANJun 17, 2014

Scaling laws and fluctuations in the statistics of word frequencies

arXiv:1406.4441v264 citations
AI Analysis

This work addresses fundamental statistical properties in natural language processing, with implications for measuring lexical richness, though it is incremental as it builds on known laws like Heaps' and Zipf's.

The authors tackled the problem of explaining scaling laws in word frequency statistics by analyzing large text databases and developing stochastic models, finding that inhomogeneous word dissemination reduces average vocabulary size while word co-occurrence correlations increase variance, making vocabulary size non-self-averaging.

In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps' law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of word-frequencies is fat tailed (Zipf's law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and ii) correlations in the co-occurrence of words lead to an increase in the variance and the vocabulary size becomes a non-self-averaging quantity. We address the implications of these observations to the measurement of lexical richness. We test our results in three large text databases (Google-ngram, Enlgish Wikipedia, and a collection of scientific articles).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes