CLSTAT-MECHAIMay 10, 2024

Correlation Dimension of Natural Language in a Statistical Manifold

arXiv:2405.06321v23 citationsh-index: 3Phys Rev Res
AI Analysis

This work provides a novel framework for analyzing complex discrete sequences like language and music, though it is incremental as it extends an existing method to a new mathematical space.

The authors measured the correlation dimension of natural language by applying the Grassberger-Procaccia algorithm in a statistical manifold using Fisher-Rao distance, finding a universal dimension around 6.5, which indicates multifractal self-similarity driven by long memory.

The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barabási-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes