CLMar 16, 2023

Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

Aashka Trivedi, Takuma Udagawa, Michele Merler, Rameswar Panda, Yousef El-Kurdi, Bishwaranjan Bhattacharjee

arXiv:2303.09639v22.510 citationsh-index: 41

Originality Incremental advance

AI Analysis

This work addresses the deployment of language models in resource-constrained environments by improving KD efficiency, though it is incremental as it builds on existing NAS and KD methods.

The paper tackles the problem of inefficient knowledge distillation (KD) from large language models by using neural architecture search (NAS) to find optimal student architectures, resulting in a student model that achieves a 7x speedup on CPU inference while maintaining 90% performance compared to the teacher model.

Large pretrained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constrained environments. However, KD can be ineffective when the student is manually selected from a set of existing options, since it can be a sub-optimal choice within the space of all possible student architectures. We develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided by KD to find the optimal student architecture for task agnostic distillation from a multilingual teacher. In each episode of the search process, a NAS controller predicts a reward based on the distillation loss and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full training corpus. KD-NAS can automatically trade off efficiency and effectiveness, and recommends architectures suitable to various latency budgets. Using our multi-layer hidden state distillation process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance, and has been deployed in 3 software offerings requiring large throughput, low latency and deployment on CPU.

View on arXiv PDF

Similar