CLSep 19, 2024

Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

arXiv:2409.12545v224 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in model compression for NLP, offering an incremental improvement over existing distillation techniques.

The paper tackles the challenge of knowledge distillation for large language models where multi-modal probability distributions hinder student learning, proposing a ranking loss method that improves student performance on downstream tasks with significant gains.

Knowledge distillation (KD) is an effective model compression method that can transfer the internal capabilities of large language models (LLMs) to smaller ones. However, the multi-modal probability distribution predicted by teacher LLMs causes difficulties for student models to learn. In this paper, we first demonstrate the importance of multi-modal distribution alignment with experiments and then highlight the inefficiency of existing KD approaches in learning multi-modal distributions. To address this problem, we propose Ranking Loss based Knowledge Distillation (RLKD), which encourages the consistency of the ranking of peak predictions between the teacher and student models. By incorporating word-level ranking loss, we ensure excellent compatibility with existing distillation objectives while fully leveraging the fine-grained information between different categories in peaks of two predicted distribution. Experimental results demonstrate that our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes