CLJan 7, 2020
Heaps' law and Heaps functions in tagged texts: Evidences of their linguistic relevanceAndrés Chacoma, Damián H. Zanette
We study the relationship between vocabulary size and text length in a corpus of $75$ literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or ``tags,'' namely, {\it nouns}, {\it verbs}, and {\it others}), and analyze the progressive appearance of new words of each tag along each individual text. While the power-law relation prescribed by Heaps' law is satisfactorily fulfilled by total vocabulary sizes and text lengths, the appearance of new words in each text is on the whole well described by the average of random shufflings of the text, which does not obey a power law. Deviations from this average, however, are statistically significant and show a systematic trend across the corpus. Specifically, they reveal that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags are shown to add systematically distinct contributions to this tendency, with {\it verbs} and {\it others} being respectively more and less retarded than the mean trend, and {\it nouns} following instead this overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' law, a feature that is still in need of extensive assessment.
CLJul 30, 2015
Information-theoretical analysis of the statistical dependencies among three variables: Applications to written languageDamián G. Hernández, Damián H. Zanette, Inés Samengo
We develop the information-theoretical concepts required to study the statistical dependencies among three variables. Some of such dependencies are pure triple interactions, in the sense that they cannot be explained in terms of a combination of pairwise correlations. We derive bounds for triple dependencies, and characterize the shape of the joint probability distribution of three binary variables with high triple interaction. The analysis also allows us to quantify the amount of redundancy in the mutual information between pairs of variables, and to assess whether the information between two variables is or is not mediated by a third variable. These concepts are applied to the analysis of written texts. We find that the probability that a given word is found in a particular location within the text is not only modulated by the presence or absence of other nearby words, but also, on the presence or absence of nearby pairs of words. We identify the words enclosing the key semantic concepts of the text, the triplets of words with high pairwise and triple interactions, and the words that mediate the pairwise interactions between other words.
CLDec 10, 2014
Statistical Patterns in Written LanguageDamián H. Zanette
Quantitative linguistics has been allowed, in the last few decades, within the admittedly blurry boundaries of the field of complex systems. A growing host of applied mathematicians and statistical physicists devote their efforts to disclose regularities, correlations, patterns, and structural properties of language streams, using techniques borrowed from statistics and information theory. Overall, results can still be categorized as modest, but the prospects are promising: medium- and long-range features in the organization of human language -which are beyond the scope of traditional linguistics- have already emerged from this kind of analysis and continue to be reported, contributing a new perspective to our understanding of this most complex communication system. This short book is intended to review some of these recent contributions.