QuickLLaMA: Query-aware Inference Acceleration for Large Language Models
This addresses the challenge of efficient and accurate long-context processing in LLMs for applications requiring deep semantic understanding, though it appears incremental as it builds on existing LLM frameworks without new training.
The paper tackles the problem of LLMs struggling with long-distance dependencies in sequences by introducing Query-aware Inference for LLMs (Q-LLM), which improves accuracy by up to 7.17% on benchmarks like LLaMA3 and Mistral, and can read Harry Potter in 30 seconds.
The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn't require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the $\infty$-bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code can be found in https://github.com/dvlab-research/Q-LLM.