IT CL LGJun 6, 2023

LLMZip: Lossless Text Compression using Large Language Models

Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-Francois Chamberland, Srinivas Shakkottai

arXiv:2306.04050v220.058 citationsh-index: 39Has Code

Originality Incremental advance

AI Analysis

This work addresses text compression for data storage and transmission, but it appears incremental as it builds on existing methods with a new model.

The paper tackled the problem of estimating the entropy of English and lossless text compression by using LLaMA-7B as a predictor, resulting in a significantly smaller asymptotic upper bound on entropy and preliminary outperformance over state-of-the-art compression schemes like BSC, ZPAQ, and paq8h.

We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in \cite{cover1978convergent}, \cite{lutati2023focus}. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.

View on arXiv PDF Code

Similar