LGCVMar 11, 2023

Knowledge Distillation for Efficient Sequences of Training Runs

arXiv:2303.06480v14 citationsh-index: 23
Originality Incremental advance
AI Analysis

This addresses the practical problem of high computational costs for machine learning practitioners performing repeated training runs, offering a significant efficiency improvement.

The paper tackles the problem of expensive sequential training runs in scenarios like hyperparameter search by using knowledge distillation from previous runs to reduce future training costs, achieving dramatic time reductions with strategies that cut KD overhead by 80-90% while maintaining accuracy.

In many practical scenarios -- like hyperparameter search or continual retraining with new data -- related training runs are performed many times in sequence. Current practice is to train each of these models independently from scratch. We study the problem of exploiting the computation invested in previous runs to reduce the cost of future runs using knowledge distillation (KD). We find that augmenting future runs with KD from previous runs dramatically reduces the time necessary to train these models, even taking into account the overhead of KD. We improve on these results with two strategies that reduce the overhead of KD by 80-90% with minimal effect on accuracy and vast pareto-improvements in overall cost. We conclude that KD is a promising avenue for reducing the cost of the expensive preparatory work that precedes training final models in practice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes