CLSep 5, 2024

Sirius: Contextual Sparsity with Correction for Efficient LLMs

arXiv:2409.03856v14 citationsh-index: 16Has Code
Originality Incremental advance
AI Analysis

This addresses inference efficiency for LLMs, particularly in complex generation tasks, but is incremental as it builds on existing CS methods.

The paper tackles the performance degradation of Contextual Sparsity (CS) methods in reasoning, deduction, and knowledge-based tasks for large language models (LLMs), and introduces Sirius, a correction mechanism that recovers model quality while maintaining efficiency gains, achieving up to 35% latency reduction.

With the blossom of large language models (LLMs), inference efficiency becomes increasingly important. Various approximation methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without quality degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, CS significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks. Despite the gap in end-to-end accuracy, we observed that sparse models often share general problem-solving logic and require only a few token corrections to recover the original model performance. This paper introduces Sirius, an efficient correction mechanism, which significantly recovers CS models quality on reasoning tasks while maintaining its efficiency gain. Sirius is evaluated on 6 models with 8 difficult generation tasks in reasoning, math, and coding and shows consistent effectiveness and efficiency. Also, we carefully develop a system implementation for Sirius and show that Sirius achieves roughly 20% reduction in latency for 8B model on-chip and 35% reduction for 70B model offloading. We open-source our implementation of Sirius at https://github.com/Infini-AI-Lab/Sirius.git.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes