On the origin of long-range correlations in texts
This addresses a fundamental issue in understanding text complexity for researchers in linguistics and complex systems, though it is incremental in building on prior observations.
The paper tackled the problem of long-range correlations in literary texts by explaining how these correlations originate from structured linguistic levels and propagate to basic text elements, showing that correlations manifest as bursty sequences near semantically relevant topics.
The complexity of human interactions with social and natural phenomena is mirrored in the way we describe our experiences through natural language. In order to retain and convey such a high dimensional information, the statistical properties of our linguistic output has to be highly correlated in time. An example are the robust observations, still largely not understood, of correlations on arbitrary long scales in literary texts. In this paper we explain how long-range correlations flow from highly structured linguistic levels down to the building blocks of a text (words, letters, etc..). By combining calculations and data analysis we show that correlations take form of a bursty sequence of events once we approach the semantically relevant topics of the text. The mechanisms we identify are fairly general and can be equally applied to other hierarchical settings.