CLOct 9, 2023

Compressing Context to Enhance Inference Efficiency of Large Language Models

arXiv:2310.06201v1228 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses efficiency challenges for users of large language models in handling long documents and conversations, though it is incremental as it builds on existing context compression techniques.

The paper tackles the problem of high computational costs and context truncation in large language models when processing long inputs by proposing Selective Context, a method that prunes redundant information to compress the context. Experimental results show a 50% reduction in context cost, leading to 36% lower memory usage and 32% faster inference time with only minor performance drops in tasks like summarization and question answering.

Large language models (LLMs) achieved remarkable performance across various tasks. However, they face challenges in managing long documents and extended conversations, due to significantly increased computational requirements, both in memory and inference time, and potential context truncation when the input exceeds the LLM's fixed context length. This paper proposes a method called Selective Context that enhances the inference efficiency of LLMs by identifying and pruning redundancy in the input context to make the input more compact. We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations, on tasks of summarisation, question answering, and response generation. Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency while maintaining comparable performance compared to that achieved when full context is used. Specifically, we achieve a 50\% reduction in context cost, resulting in a 36\% reduction in inference memory usage and a 32\% reduction in inference time, while observing only a minor drop of .023 in BERTscore and .038 in faithfulness on four downstream applications, indicating that our method strikes a good balance between efficiency and performance.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes