CLSTAT-MECHOct 20, 2017

Is space a word, too?

arXiv:1710.07729v11 citations
Originality Incremental advance
AI Analysis

This resolves a long-standing theoretical dispute in linguistics and statistical physics about word frequency distributions, though it appears incremental in refining existing models.

The paper tackles the unresolved universality and mechanistic generation of Zipf's law by showing that including non-word objects like spaces and punctuation connects the Zipf-Mandelbrot law to Simon's model, resolving it as a generalized rank-frequency law.

For words, rank-frequency distributions have long been heralded for adherence to a potentially-universal phenomenon known as Zipf's law. The hypothetical form of this empirical phenomenon was refined by Benîot Mandelbrot to that which is presently referred to as the Zipf-Mandelbrot law. Parallel to this, Herbet Simon proposed a selection model potentially explaining Zipf's law. However, a significant dispute between Simon and Mandelbrot, notable empirical exceptions, and the lack of a strong empirical connection between Simon's model and the Zipf-Mandelbrot law have left the questions of universality and mechanistic generation open. We offer a resolution to these issues by exhibiting how the dark matter of word segmentation, i.e., space, punctuation, etc., connect the Zipf-Mandelbrot law to Simon's mechanistic process. This explains Mandelbrot's refinement as no more than a fudge factor, accommodating the effects of the exclusion of the rank-frequency dark matter. Thus, integrating these non-word objects resolves a more-generalized rank-frequency law. Since this relies upon the integration of space, etc., we find support for the hypothesis that $all$ are generated by common processes, indicating from a physical perspective that space is a word, too.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes