Simple Context Compression: Mean-Pooling and Multi-Ratio Training
This work addresses computational efficiency in RAG for LLM users, but it is incremental as it builds on existing soft compression methods.
The paper tackled the problem of reducing computational costs in retrieval-augmented generation with long contexts by developing a simple mean-pooling approach for soft context compression, which consistently outperformed a widely used compression-tokens architecture and showed strong performance with a small drop when trained for multiple compression ratios.
A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.