CLApr 15, 2025

Minitron-SSM: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

arXiv:2504.11409v27 citationsh-index: 45
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently compressing hybrid language models for deployment, representing an incremental improvement over existing compression techniques.

The authors tackled the problem of compressing hybrid LLM architectures that combine Attention and State Space Models (SSMs), achieving a 4B parameter model from an 8B base with 40x fewer training tokens that surpasses similarly-sized models in accuracy while achieving 2x faster inference.

Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes