CVFeb 26, 2025

A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs

arXiv:2502.19159v35 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This work addresses efficient inference for resource-constrained scenarios in LLMs, offering an incremental improvement over existing pruning techniques.

The paper tackles the problem of depth-wise pruning in large language models degrading performance by indiscriminately discarding entire layers, proposing a sliding layer merging method that dynamically fuses consecutive layers based on similarity to simplify the model while maintaining performance, achieving a 1.654% improvement in average zero-shot task performance with 35% pruning on Vicuna-7B compared to existing methods.

Compared to width-wise pruning, depth-wise pruning can significantly accelerate inference in resource-constrained scenarios. However, treating the entire Transformer layer as the minimum pruning unit may degrade model performance by indiscriminately discarding the entire information of the layer. This paper reveals the ``Patch-like'' feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space. Building on this observation, we propose a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35% pruning on the Vicuna-7B model, our method achieved a 1.654% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at https://github.com/920927/SLM-a-sliding-layer-merging-method.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes