Łukasz Dębowski

IT
h-index1
7papers
31citations
Novelty32%
AI Score32

7 Papers

ITFeb 17, 2023
Multiperiodic Processes: Ergodic Sources with a Sublinear Entropy

Łukasz Dębowski

We construct multiperiodic processes -- a simple example of stationary ergodic (but not mixing) processes over natural numbers that enjoy the vanishing entropy rate under a mild condition. Multiperiodic processes are supported on randomly shifted deterministic sequences called multiperiodic sequences, which can be efficiently generated using an algorithm called the Infinite Clock. Under a suitable parameterization, multiperiodic sequences exhibit relative frequencies of particular numbers given by Zipf's law. Exactly in the same setting, the respective multiperiodic processes satisfy an asymptotic power-law growth of block entropy, called Hilberg's law. Hilberg's law is deemed to hold for statistical language models, in particular.

CLJul 24, 2023
Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models

Łukasz Dębowski

The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the proportion of hapaxes, i.e., words that occur once. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the hapax rate is a simple function of the text length. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.

ITSep 27, 2022
Local Grammar-Based Coding Revisited

Łukasz Dębowski

In the setting of minimal local grammar-based coding, the input string is represented as a grammar with the minimal output length defined via simple symbol-by-symbol encoding. This paper discusses four contributions to this field. First, we invoke a simple harmonic bound on ranked probabilities, which reminds Zipf's law and simplifies universality proofs for minimal local grammar-based codes. Second, we refine known bounds on the vocabulary size, showing its partial power-law equivalence with mutual information and redundancy. These bounds are relevant for linking Zipf's law with the neural scaling law for large language models. Third, we develop a framework for universal codes with fixed infinite vocabularies, recasting universal coding as matching ranked patterns that are independent of empirical data. Finally, we analyze grammar-based codes with finite vocabularies being empirical rank lists, proving that that such codes are also universal. These results extend foundations of universal grammar-based coding and reaffirm previously stated connections to power laws for human language and language models.

ITDec 15, 2025
From Zipf's Law to Neural Scaling through Heaps' Law and Hilberg's Hypothesis

Łukasz Dębowski

We inspect the deductive connection between the neural scaling law and Zipf's law -- two statements discussed in machine learning and quantitative linguistics. The neural scaling law describes how the cross entropy rate of a foundation model -- such as a large language model -- changes with respect to the amount of training tokens, parameters, and compute. By contrast, Zipf's law posits that the distribution of tokens exhibits a power law tail. Whereas similar claims have been made in more specific settings, we show that the neural scaling law is a consequence of Zipf's law under certain broad assumptions that we reveal systematically. The derivation steps are as follows: We derive Heaps' law on the vocabulary growth from Zipf's law, Hilberg's hypothesis on the entropy scaling from Heaps' law, and the neural scaling from Hilberg's hypothesis. We illustrate these inference steps by a toy example of the Santa Fe process that satisfies all the four statistical laws.

ITJun 14, 2017
Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited

Łukasz Dębowski

As we discuss, a stationary stochastic process is nonergodic when a random persistent topic can be detected in the infinite random text sampled from the process, whereas we call the process strongly nonergodic when an infinite sequence of independent random bits, called probabilistic facts, is needed to describe this topic completely. Replacing probabilistic facts with an algorithmically random sequence of bits, called algorithmic facts, we adapt this property back to ergodic processes. Subsequently, we call a process perigraphic if the number of algorithmic facts which can be inferred from a finite text sampled from the process grows like a power of the text length. We present a simple example of such a process. Moreover, we demonstrate an assertion which we call the theorem about facts and words. This proposition states that the number of probabilistic or algorithmic facts which can be inferred from a text drawn from a process must be roughly smaller than the number of distinct word-like strings detected in this text by means of the PPM compression algorithm. We also observe that the number of the word-like strings for a sample of plays by Shakespeare follows an empirical stepwise power law, in a stark contrast to Markov processes. Hence we suppose that natural language considered as a process is not only non-Markov but also perigraphic.

ITOct 31, 2013
A Preadapted Universal Switch Distribution for Testing Hilberg's Conjecture

Łukasz Dębowski

Hilberg's conjecture about natural language states that the mutual information between two adjacent long blocks of text grows like a power of the block length. The exponent in this statement can be upper bounded using the pointwise mutual information estimate computed for a carefully chosen code. The bound is the better, the lower the compression rate is but there is a requirement that the code be universal. So as to improve a received upper bound for Hilberg's exponent, in this paper, we introduce two novel universal codes, called the plain switch distribution and the preadapted switch distribution. Generally speaking, switch distributions are certain mixtures of adaptive Markov chains of varying orders with some additional communication to avoid so called catch-up phenomenon. The advantage of these distributions is that they both achieve a low compression rate and are guaranteed to be universal. Using the switch distributions we obtain that a sample of a text in English is non-Markovian with Hilberg's exponent being $\le 0.83$, which improves over the previous bound $\le 0.94$ obtained using the Lempel-Ziv code.

STAT-MECHApr 27, 2013
Constant conditional entropy and related hypotheses

Ramon Ferrer-i-Cancho, Łukasz Dębowski, Fermín Moscoso del Prado Martín

Constant entropy rate (conditional entropies must remain constant as the sequence length increases) and uniform information density (conditional probabilities must remain constant as the sequence length increases) are two information theoretic principles that are argued to underlie a wide range of linguistic phenomena. Here we revise the predictions of these principles to the light of Hilberg's law on the scaling of conditional entropy in language and related laws. We show that constant entropy rate (CER) and two interpretations for uniform information density (UID), full UID and strong UID, are inconsistent with these laws. Strong UID implies CER but the reverse is not true. Full UID, a particular case of UID, leads to costly uncorrelated sequences that are totally unrealistic. We conclude that CER and its particular cases are incomplete hypotheses about the scaling of conditional entropies.