CLLGFeb 4

Proxy Compression for Language Modeling

arXiv:2602.04289v13 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the efficiency and flexibility limitations of tokenizer-based training for language models, particularly in code domains, though it is an incremental improvement over existing byte-level methods.

The paper tackles the problem of coupling language models to fixed tokenizers by introducing proxy compression, which trains models on both raw bytes and compressed inputs to enable efficient training and strong performance at inference on raw bytes. Experiments on code language modeling show substantial efficiency gains and performance that matches or rivals tokenizer approaches as model scale increases.

Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes