CLSOC-PHOTMar 6, 2018

Co-occurrence of the Benford-like and Zipf Laws Arising from the Texts Representing Human and Artificial Languages

arXiv:1803.03667v10.23 citations
Originality Synthesis-oriented
AI Analysis

This work provides insights into statistical patterns in language data, which could inform natural language processing and computational linguistics, but it is incremental as it applies known laws to new language types.

The study analyzed large texts from human (English, Russian, Ukrainian) and artificial (C++, Java) languages, finding that they exhibit patterns described by the Benford-like and Zipf laws, with artificial languages showing steeper slopes in double logarithmic plots compared to human languages.

We demonstrate that large texts, representing human (English, Russian, Ukrainian) and artificial (C++, Java) languages, display quantitative patterns characterized by the Benford-like and Zipf laws. The frequency of a word following the Zipf law is inversely proportional to its rank, whereas the total numbers of a certain word appearing in the text generate the uneven Benford-like distribution of leading numbers. Excluding the most popular words essentially improves the correlation of actual textual data with the Zipfian distribution, whereas the Benford distribution of leading numbers (arising from the overall amount of a certain word) is insensitive to the same elimination procedure. The calculated values of the moduli of slopes of double logarithmical plots for artificial languages (C++, Java) are markedly larger than those for human ones.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes