Sentence-Anchored Gist Compression for Long-Context LLMs
This addresses efficiency issues for users of long-context LLMs, but it is incremental as it builds on existing compression techniques.
The paper tackles the problem of reducing memory and computational demands in LLMs by fine-tuning them to compress context by 2x to 8x without significant performance loss, achieving results comparable to other methods with higher compression ratios on a 3-billion-parameter model.
This work investigates context compression for Large Language Models (LLMs) using learned compression tokens to reduce the memory and computational demands of processing long sequences. We demonstrate that pre-trained LLMs can be fine-tuned to compress their context by factors of 2x to 8x without significant performance degradation, as evaluated on both short-context and long-context benchmarks. Furthermore, in experiments on a 3-billion-parameter LLaMA model, our method achieves results on par with alternative compression techniques while attaining higher compression ratios.