Block removal for large language models through constrained binary optimization

arXiv:2602.00161v11 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses the resource-intensive nature of large language models for deployment, offering an efficient compression method with broad applicability, though it is incremental as it builds on existing block-removal techniques.

The paper tackles the problem of compressing large language models by removing transformer blocks, formulating it as a constrained binary optimization problem mapped to an Ising model to efficiently rank removal configurations. It demonstrates state-of-the-art performance with up to 6-point improvements on the MMLU benchmark after retraining.

Compressing resource-intensive large language models by removing whole transformer blocks is a seemingly simple idea, but identifying which blocks to remove constitutes an exponentially difficult combinatorial problem. In this paper, we formulate block removal as a constrained binary optimization problem that can be mapped to a physical system (Ising model), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations and yields many high-quality, non-trivial solutions beyond consecutive regions. We demonstrate that our approach outperforms state-of-the-art block-removal methods across several benchmarks, with performance gains persisting after short retraining, and reaching improvements of up to 6 points on the MMLU benchmark. Our method requires only forward and backward passes for a few active parameters, together with an (at least approximate) Ising solver, and can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes