IT CLOct 31, 2013

A Preadapted Universal Switch Distribution for Testing Hilberg's Conjecture

arXiv:1310.8511v28 citations

Originality Incremental advance

AI Analysis

This work addresses a theoretical problem in linguistics and information theory by providing a tighter bound for Hilberg's conjecture, though it is incremental as it builds on prior methods.

The paper tackled the problem of improving the upper bound for Hilberg's exponent in natural language by introducing two novel universal codes, the plain and preadapted switch distributions, achieving a bound of ≤0.83 compared to the previous ≤0.94 using Lempel-Ziv code.

Hilberg's conjecture about natural language states that the mutual information between two adjacent long blocks of text grows like a power of the block length. The exponent in this statement can be upper bounded using the pointwise mutual information estimate computed for a carefully chosen code. The bound is the better, the lower the compression rate is but there is a requirement that the code be universal. So as to improve a received upper bound for Hilberg's exponent, in this paper, we introduce two novel universal codes, called the plain switch distribution and the preadapted switch distribution. Generally speaking, switch distributions are certain mixtures of adaptive Markov chains of varying orders with some additional communication to avoid so called catch-up phenomenon. The advantage of these distributions is that they both achieve a low compression rate and are guaranteed to be universal. Using the switch distributions we obtain that a sample of a text in English is non-Markovian with Hilberg's exponent being $\le 0.83$, which improves over the previous bound $\le 0.94$ obtained using the Lempel-Ziv code.

View on arXiv PDF

Similar