LGFeb 7, 2024

Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training

Hanna Mazzawi, Pranjal Awasthi, Xavi Gonzalvo, Srikumar Ramalingam

arXiv:2402.05033v22.6h-index: 32

Originality Highly original

AI Analysis

This addresses the need for efficient small model deployment in constrained environments, offering a novel alternative to traditional two-phase methods like distillation.

The paper tackles the problem of training both large and small models simultaneously by introducing Majority Kernels, an architectural change compatible with standard models like MLPs, ResNets, and Transformers, resulting in performance gains across tasks with minimal training overhead and outperforming baselines like distilled ensembles.

Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. Formally, this can be seen as the problem of identifying an optimal model of size $n$ from a larger model of size $k \cdot n$, where $k > 1$ is the overparameterization factor. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment. Our contribution is an effective architectural change, namely, {\it Majority Kernels} that is compatible with the main standard architectures such as multi-layer perceptrons (MLPs), Residual networks (ResNets), and Transformers. We demonstrate that applying our technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as distilled ensembles as well as combinatorial optimization methods based on submodular optimization.

View on arXiv PDF

Similar