Frequency-Ordered Tokenization for Better Text Compression
This work provides a simple yet effective preprocessing method to improve text compression ratios and speed for anyone using standard compression algorithms on natural language text, offering an incremental but practical gain.
This paper introduces frequency-ordered tokenization, a preprocessing technique that reorders BPE vocabulary based on token frequency and encodes them with variable-length integers. This method significantly improves lossless text compression, yielding gains of 7.08 percentage points for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd on enwik8, and also accelerates compression for computationally expensive algorithms by making the input smaller.
We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.