CL AIMar 28, 2024

Streamlining Redundant Layers to Compress Large Language Models

Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, Hong Chen

arXiv:2403.19135v515.430 citationsh-index: 27Has Code

Originality Highly original

AI Analysis

This work addresses model compression for large language models, offering a novel pruning approach that is incremental in improving efficiency and performance.

The paper tackles the problem of compressing large language models by pruning redundant layers, introducing LLM-Streamline which removes less important layers and replaces them with a lightweight network to reduce performance loss. Experiments show it outperforms state-of-the-art pruning methods in performance and training efficiency.

This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs). It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.Our code is available at https://github.com/RUCKBReasoning/LLM-Streamline

View on arXiv PDF Code

Similar