CLMay 11, 2023

Autocorrelations Decay in Texts and Applicability Limits of Language Models

arXiv:2305.06615v19 citations

Originality Incremental advance

AI Analysis

This identifies a potential limitation in language models for applications involving long texts, such as analysis or generation.

The study found that word autocorrelations in texts decay via a power law, and this decay differs in generated versus literary texts, indicating that language models with Markov behavior may have limitations for long-text tasks.

We show that the laws of autocorrelations decay in texts are closely related to applicability limits of language models. Using distributional semantics we empirically demonstrate that autocorrelations of words in texts decay according to a power law. We show that distributional semantics provides coherent autocorrelations decay exponents for texts translated to multiple languages. The autocorrelations decay in generated texts is quantitatively and often qualitatively different from the literary texts. We conclude that language models exhibiting Markov behavior, including large autoregressive language models, may have limitations when applied to long texts, whether analysis or generation.

View on arXiv PDF

Similar