Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
This work addresses the need for more efficient inference in large language models, which is crucial for practical deployment, though it is incremental as it builds on existing Mamba and Transformer architectures.
The authors tackled the problem of high inference costs in large language models by introducing Nemotron-H, a family of hybrid Mamba-Transformer models that achieve up to 3x faster inference while maintaining similar or better accuracy compared to state-of-the-art open-sourced models like Qwen-2.5 and Llama-3.1.
As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. We are releasing Nemotron-H base model checkpoints with support in Hugging Face and NeMo.