LLM Pruning and Distillation in Practice: The Minitron Approach
This addresses the problem of reducing computational costs and deployment barriers for large language models, though it is incremental as it builds on existing compression techniques.
The paper tackled compressing large language models like Llama 3.1 8B and Mistral NeMo 12B to smaller sizes (4B and 8B parameters) using pruning and distillation, resulting in a compelling 4B model and a state-of-the-art 8B model.
We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.