CLAIIRSep 3, 2024

You Only Use Reactive Attention Slice For Long Context Retrieval

arXiv:2409.13695v11 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This addresses retrieval bottlenecks for LLM applications with long contexts, but it is incremental as it builds on existing RAG and attention mechanisms.

The paper tackles the problem of inefficient retrieval in long-context LLMs by proposing YOURA, an attention-based retrieval technique that achieves up to 30% inference throughput improvement while maintaining similar quality to baseline methods.

Supporting longer context for Large Language Models (LLM) is a promising direction to advance LLMs. As training a model for a longer context window is computationally expensive, many alternative solutions, such as Retrieval Augmented Generation (RAG), have been used. However, most existing RAG methods adopt embedding-based retrieval that falls short on long contexts. To address such challenges, we propose an attention-based retrieval technique, You Only Use Reactive Attention slice (YOURA). YOURA leverages a novel retrieval heuristic called reaction score to rank the relevance of each sentence in the input context with the query sentence. Intuitively, we measure how the per-token attention score "reacts" to the query and greedily retrieves the most reactive sentences. Internally, YOURA generates a token-indexed vector (called reaction vector) for the whole input context. To map each sentence to the token-indexed vector, we propose an Embedding-Agnostic Sentence Yield (EASY), a best-effort token wiggling algorithm. We evaluate our retrieval technique on three open-source pre-trained LLM models across six LongBench QA datasets. Our technique achieves up to 30% vLLM inference throughput improvement for serving long-context queries with a nearly identical quality score to the simple yet effective truncate-middle approach.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes