LGOct 19, 2021

When in Doubt, Summon the Titans: Efficient Inference with Large Models

arXiv:2110.10305v19 citations
Originality Incremental advance
AI Analysis

This addresses the problem of deploying large models in real-world settings where computational efficiency is critical, offering an incremental improvement over standard distillation methods.

The paper tackles the high inference cost of large neural networks by proposing a two-stage distillation framework that uses large teacher models for hard examples and lightweight student models for easy ones, achieving better accuracy and reduced amortized inference cost on image classification and NLP benchmarks.

Scaling neural networks to "large" sizes, with billions of parameters, has been shown to yield impressive results on many challenging problems. However, the inference cost incurred by such large models often prevents their application in most real-world settings. In this paper, we propose a two-stage framework based on distillation that realizes the modelling benefits of the large models, while largely preserving the computational benefits of inference with more lightweight models. In a nutshell, we use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples; for the "hard" examples, we fall-back to the teacher. Such an approach allows us to efficiently employ large models in practical scenarios where easy examples are much more frequent than rare hard examples. Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference and achieving better accuracy than standard distillation. Empirically, we demonstrate the benefits of our approach on both image classification and natural language processing benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes