CLJan 29, 2022

AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

arXiv:2201.12507v25 citations
AI Analysis

This addresses the challenge of efficiently compressing large language models for deployment, though it is incremental as it builds on existing NAS and distillation methods.

The paper tackles the problem of manually designing student architectures for knowledge distillation by using Neural Architecture Search to automatically generate compressed models with variable computational costs, achieving up to 2.7x reduction in cost with negligible performance loss on the GLUE benchmark.

Knowledge distillation (KD) methods compress large models into smaller students with manually-designed student architectures given pre-specified computational cost. This requires several trials to find a viable student, and further repeating the process for each student or computational budget change. We use Neural Architecture Search (NAS) to automatically distill several compressed students with variable cost from a large model. Current works train a single SuperLM consisting of millions of subnetworks with weight-sharing, resulting in interference between subnetworks of different sizes. Our framework AutoDistil addresses above challenges with the following steps: (a) Incorporates inductive bias and heuristics to partition Transformer search space into K compact sub-spaces (K=3 for typical student sizes of base, small and tiny); (b) Trains one SuperLM for each sub-space using task-agnostic objective (e.g., self-attention distillation) with weight-sharing of students; (c) Lightweight search for the optimal student without re-training. Fully task-agnostic training and search allow students to be reused for fine-tuning on any downstream task. Experiments on GLUE benchmark against state-of-the-art KD and NAS methods demonstrate AutoDistil to outperform leading compression techniques with upto 2.7x reduction in computational cost and negligible loss in task performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes