Merino: Entropy-driven Design for Generative Language Models on IoT Devices
This addresses the challenge of deploying generative language models on resource-constrained IoT devices, representing an incremental improvement in model efficiency for mobile settings.
The paper tackles the problem of scaling down generative large language models for IoT devices by proposing an entropy-driven design framework, resulting in models that achieve similar or better performance than a 350M parameter model while being 4.9x faster and 5.5x smaller.
Generative Large Language Models (LLMs) stand as a revolutionary advancement in the modern era of artificial intelligence (AI). However, scaling down LLMs for resource-constrained hardware, such as Internet-of-Things (IoT) devices requires non-trivial efforts and domain knowledge. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative language models. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size.