The Lossy Horizon: Error-Bounded Predictive Coding for Lossy Text Compression (Episode I)
This addresses the problem of efficient lossy text compression for applications requiring high compression ratios with controlled reconstruction errors, representing an incremental advance in leveraging LLMs for compression tasks.
The paper tackles lossy text compression by introducing Error-Bounded Predictive Coding (EPC), which uses a Masked Language Model to predict masked content and stores minimal corrections, achieving superior fidelity at a significantly lower bit rate compared to baselines.
Large Language Models (LLMs) can achieve near-optimal lossless compression by acting as powerful probability models. We investigate their use in the lossy domain, where reconstruction fidelity is traded for higher compression ratios. This paper introduces Error-Bounded Predictive Coding (EPC), a lossy text codec that leverages a Masked Language Model (MLM) as a decompressor. Instead of storing a subset of original tokens, EPC allows the model to predict masked content and stores minimal, rank-based corrections only when the model's top prediction is incorrect. This creates a residual channel that offers continuous rate-distortion control. We compare EPC to a simpler Predictive Masking (PM) baseline and a transform-based Vector Quantisation with a Residual Patch (VQ+RE) approach. Through an evaluation that includes precise bit accounting and rate-distortion analysis, we demonstrate that EPC consistently dominates PM, offering superior fidelity at a significantly lower bit rate by more efficiently utilising the model's intrinsic knowledge.