CLIRJul 12, 2024

Context Embeddings for Efficient Answer Generation in RAG

arXiv:2407.09252v334 citationsh-index: 26
Originality Incremental advance
AI Analysis

This addresses the efficiency bottleneck in RAG systems for users waiting for answers, though it appears incremental as it builds on prior context compression techniques.

The paper tackles the problem of slow decoding times in Retrieval-Augmented Generation (RAG) due to long contextual inputs by introducing COCOM, a context compression method that reduces contexts to a few embeddings, achieving a speed-up of up to 5.69× while improving performance over existing methods.

Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 $\times$ while achieving higher performance compared to existing efficient context compression methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes