LG AIJun 17, 2025

Less is More: Undertraining Experts Improves Model Upcycling

Stefan Horoi, Guy Wolf, Eugene Belilovsky, Gintare Karolina Dziugaite

arXiv:2506.14126v113.03 citations

Originality Incremental advance

AI Analysis

This addresses a critical bottleneck in efficiently reusing fine-tuned models for multi-task systems, benefiting practitioners in machine learning by optimizing resource utilization, though it is incremental as it builds on existing upcycling methods.

The paper tackles the problem that long fine-tuning of expert models for individual task performance degrades their merging performance in model upcycling, showing that this leads to worse downstream results, and demonstrates that an aggressive early stopping strategy can significantly improve upcycling performance, with concrete numbers indicating up to 15% gains in merging accuracy.

Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. To leverage these resources, numerous model upcycling methods have emerged, enabling the reuse of fine-tuned models in multi-task systems. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then upcycled into more general-purpose systems. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model upcycling. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance, both for fully fine-tuned and LoRA-adapted models, and to worse downstream results when LoRA adapters are upcycled into MoE layers. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps and are subsequently forgotten during merging. Finally, we demonstrate that a task-dependent aggressive early stopping strategy can significantly improve upcycling performance.

View on arXiv PDF

Similar