CLMar 18, 2025

The KoLMogorov Test: Compression by Code Generation

arXiv:2503.13992v14 citationsh-index: 16ICLR
Originality Incremental advance
AI Analysis

This work addresses the challenge of assessing intelligence in LLMs through compression, which is incremental as it builds on existing compression and code generation methods but introduces a new test framework.

The paper tackles the problem of evaluating and training code-generating LLMs for compression by introducing the KoLMogorov-Test (KT), which requires models to generate the shortest program that outputs a given sequence, and finds that current flagship models like GPT4-o and Llama-3.1-405B perform poorly on both natural and synthetic data, with trained models achieving lower compression rates on synthetic data but showing poor generalization to real data.

Compression is at the heart of intelligence. A theoretically optimal way to compress any sequence of data is to find the shortest program that outputs that sequence and then halts. However, such 'Kolmogorov compression' is uncomputable, and code generating LLMs struggle to approximate this theoretical ideal, as it requires reasoning, planning and search capabilities beyond those of current models. In this work, we introduce the KoLMogorov-Test (KT), a compression-as-intelligence test for code generating LLMs. In KT a model is presented with a sequence of data at inference time, and asked to generate the shortest program that produces the sequence. We identify several benefits of KT for both evaluation and training: an essentially infinite number of problem instances of varying difficulty is readily available, strong baselines already exist, the evaluation metric (compression) cannot be gamed, and pretraining data contamination is highly unlikely. To evaluate current models, we use audio, text, and DNA data, as well as sequences produced by random synthetic programs. Current flagship models perform poorly - both GPT4-o and Llama-3.1-405B struggle on our natural and synthetic sequences. On our synthetic distribution, we are able to train code generation models with lower compression rates than previous approaches. Moreover, we show that gains on synthetic data generalize poorly to real data, suggesting that new innovations are necessary for additional gains on KT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes