LGFeb 18, 2025

Scalable Model Merging with Progressive Layer-wise Distillation

arXiv:2502.12706v29 citationsh-index: 4ICML
Originality Highly original
AI Analysis

This addresses the challenge of efficiently integrating multiple fine-tuned models for practitioners in machine learning, with incremental improvements over existing few-shot methods.

The paper tackles the problem of performance degradation in model merging when little or no data is available, by introducing ProDistill, a few-shot merging algorithm that achieves state-of-the-art performance with improvements of up to 6.14% and 6.61% in vision and NLU tasks.

Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods, ProDistill achieves state-of-the-art performance, with up to 6.14% and 6.61% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes