Linguistic Structure from a Bottleneck on Sequential Information Processing
This addresses the fundamental question of why language has its specific structure, potentially impacting linguistics and cognitive science, but it is incremental in building on existing ideas about cognitive constraints.
The paper tackled the problem of explaining the systematic structure of human language by showing that it arises from constraints on predictive information, a statistical measure of complexity, and found that human languages reduce predictive information across multiple linguistic levels compared to baselines.
Human language has a distinct systematic structure, where utterances break into individually meaningful words which are combined to form phrases. We show that natural-language-like systematicity arises in codes that are constrained by a statistical measure of complexity called predictive information, also known as excess entropy. Predictive information is the mutual information between the past and future of a stochastic process. In simulations, we find that such codes break messages into groups of approximately independent features which are expressed systematically and locally, corresponding to words and phrases. Next, drawing on crosslinguistic text corpora, we find that actual human languages are structured in a way that reduces predictive information compared to baselines at the levels of phonology, morphology, syntax, and lexical semantics. Our results establish a link between the statistical and algebraic structure of language and reinforce the idea that these structures are shaped by communication under general cognitive constraints.