LGAINEAug 20, 2025

Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats

arXiv:2508.19263v14 citations
Originality Incremental advance
AI Analysis

This work addresses storage and memory efficiency for deploying large deep learning models, but it is incremental as it builds on prior compression methods.

The paper tackled the problem of reducing storage and transmission costs for neural network components by extending lossless compression to low-precision formats like FP8 and FP4, achieving compression ratios up to 62% for BF16 and 83% for FP8, and also found that key-value caches in LLMs are compressible for memory savings.

As deep learning models grow and deployment becomes more widespread, reducing the storage and transmission costs of neural network weights has become increasingly important. While prior work such as ZipNN has shown that lossless compression methods - particularly those based on Huffman encoding floating-point exponents can significantly reduce model sizes, these techniques have primarily been applied to higher-precision formats such as FP32 and BF16. In this work, we extend the ZipNN approach to lower-precision floating-point formats, specifically FP8 and FP4, which are gaining popularity for efficient inference. We design a compression method that separates and compresses the exponent and mantissa components independently using entropy coding. Our evaluation shows compression ratios up to 62% for BF16 and 83% for FP8. We also investigate the compressibility of key-value (K/V) cache tensors used in large language models (LLMs), finding that they, too, exhibit compressible patterns, enabling memory savings during deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes