LGAICLOct 15, 2024

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

arXiv:2410.11261v232 citationsh-index: 21
Originality Highly original
AI Analysis

This work addresses the challenge of memory and computational constraints for deploying LLMs on resource-constrained devices, representing a foundational advancement in pruning algorithm design.

The paper tackles the problem of deploying large language models on edge devices by introducing a novel pruning approach that directly optimizes for approximating the attention matrix, accounting for its non-linear nature, and demonstrates significant computational cost reductions beyond state-of-the-art methods like SparseGPT and Wanda.

Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives, from conversational AI to search and AI assistants. However, their growing capabilities come at the cost of extremely large model sizes, making deployment on edge devices challenging due to memory and computational constraints. This paper introduces a novel approach to LLM weight pruning that directly optimizes for approximating the attention matrix, a core component of transformer architectures. Unlike existing methods that focus on linear approximations, our approach accounts for the non-linear nature of the Softmax attention mechanism. We provide theoretical guarantees for the convergence of our Gradient Descent-based optimization method to a near-optimal pruning mask solution. Our empirical results demonstrate the effectiveness of our non-linear pruning approach in maintaining model performance while significantly reducing computational costs, which is beyond the current state-of-the-art methods, i.e., SparseGPT and Wanda, by a large margin. This work establishes a new theoretical foundation for pruning algorithm design in LLMs, potentially paving the way for more efficient LLM inference on resource-constrained devices.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes