CLAILGFeb 6, 2024

DistiLLM: Towards Streamlined Distillation for Large Language Models

arXiv:2402.03898v2105 citationsh-index: 10ICML
Originality Incremental advance
AI Analysis

This addresses the computational cost and lack of standardization in distillation for auto-regressive language models, offering an incremental improvement for model compression.

The paper tackled the problem of inefficient knowledge distillation for large language models by introducing DistiLLM, which achieved up to 4.3x speedup compared to recent methods while building high-performing student models.

Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.

Code Implementations5 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes