LG CLFeb 5, 2025

Leveraging the true depth of LLMs

Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret

arXiv:2502.02790v215.75 citationsh-index: 18Trans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This work addresses inference efficiency for large-scale LLM deployment, offering an incremental improvement over existing layer removal or reordering methods.

The paper tackles the high compute requirements of LLMs by proposing a method to group consecutive layers into pairs evaluated in parallel, achieving an inference throughput improvement of 1.05x-1.20x while retaining 95%-99% of original accuracy without retraining.

Large Language Models (LLMs) demonstrate remarkable capabilities at the cost of high compute requirements. Recent studies have demonstrated that intermediate layers in LLMs can be removed or reordered without substantial accuracy loss; however, this insight has not yet been exploited to improve inference efficiency. Leveraging observed layer independence, we propose a novel method that groups consecutive layers into pairs evaluated in parallel, effectively restructuring the computational graph to enhance parallelism. Without requiring retraining or fine-tuning, this approach achieves an inference throughput improvement of 1.05x-1.20x on standard benchmarks, retaining 95\%-99\% of the original model accuracy. Empirical results demonstrate the practicality of this method in significantly reducing inference cost for large-scale LLM deployment. Additionally, we demonstrate that modest performance degradation can be substantially mitigated through lightweight fine-tuning, further enhancing the method's applicability.

View on arXiv PDF

Similar