STAT-MECH CL BIO-PH NCDec 31, 2025

Large language models and the entropy of English

Colin Scheibner, Lindsay M. Smith, William Bialek

Princeton

arXiv:2512.24969v11 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work provides insights into the statistical physics of language and LLMs, with implications for modeling efforts, though it is incremental in nature.

The study used large language models to analyze long-range structure in English texts, finding that conditional entropy decreases with context lengths up to 10^4 characters, indicating persistent dependencies and correlations across large distances.

We use large language models (LLMs) to uncover long-ranged structure in English texts from a variety of sources. The conditional entropy or code length in many cases continues to decrease with context length at least to $N\sim 10^4$ characters, implying that there are direct dependencies or interactions across these distances. A corollary is that there are small but significant correlations between characters at these separations, as we show from the data independent of models. The distribution of code lengths reveals an emergent certainty about an increasing fraction of characters at large $N$. Over the course of model training, we observe different dynamics at long and short context lengths, suggesting that long-ranged structure is learned only gradually. Our results constrain efforts to build statistical physics models of LLMs or language itself.

View on arXiv PDF

Similar