Large Language Model as Token Compressor and Decompressor
This addresses token efficiency for long-context reasoning in NLP applications, though it is incremental as it builds on existing LLM and LoRA techniques.
The paper tackles the problem of token inefficiency in long texts by using an off-the-shelf LLM as a compressor and decompressor, achieving up to 18 times token reduction on datasets like Wikipedia and CNN/DailyMail while maintaining reconstruction fidelity and downstream performance.
In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.