LGCLJun 10, 2024

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

arXiv:2406.05955v241 citationsHas Code
AI Analysis

This work addresses efficiency bottlenecks for deploying LLMs on resource-constrained devices like mobile phones, representing a strong domain-specific advancement.

The paper tackles the problem of accelerating large language model (LLM) inference by improving activation sparsity, proposing a novel dReLU function and training data mixture to achieve state-of-the-art performance with minimal activated parameters, resulting in 2-5x decoding speedup and 11 tokens per second on mobile phones.

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at \url{https://huggingface.co/PowerInfer}

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes