IT ITMar 26

Investigating the Fundamental Limit: A Feasibility Study of Hybrid-Neural Archival

Marcus Armstrong, ZiWei Qiu, Huy Q. Vo, Arjun Mukherjee

arXiv:2603.2552620.6h-index: 3

AI Analysis

This work addresses the problem of semantic redundancy in archival storage for researchers, but it is incremental as it establishes a baseline for future studies.

The study investigated the feasibility of using Large Language Models (LLMs) for lossless compression, identifying hardware non-determinism as a critical barrier and resolving it with a novel quantization protocol, achieving compression rates of 0.39 BPC on memorized data and 0.75 BPC on unseen data, though inference was 2600× slower than Zstd.

Large Language Models (LLMs) possess a theoretical capability to model information density far beyond the limits of classical statistical methods (e.g., Lempel-Ziv). However, utilizing this capability for lossless compression involves navigating severe system constraints, including non-deterministic hardware and prohibitive computational costs. In this work, we present an exploratory study into the feasibility of LLM-based archival systems. We introduce \textbf{Hybrid-LLM}, a proof-of-concept architecture designed to investigate the "entropic capacity" of foundation models in a storage context. \textbf{We identify a critical barrier to deployment:} the "GPU Butterfly Effect," where microscopic hardware non-determinism precludes data recovery. We resolve this via a novel logit quantization protocol, enabling the rigorous measurement of neural compression rates on real-world data. Our experiments reveal a distinct divergence between "retrieval-based" density (0.39 BPC on memorized literature) and "predictive" density (0.75 BPC on unseen news). While current inference latency ($\approx 2600\times$ slower than Zstd) limits immediate deployment to ultra-cold storage, our findings demonstrate that LLMs successfully capture semantic redundancy inaccessible to classical algorithms, establishing a baseline for future research into semantic file systems.

View on arXiv PDF

Similar