CLApr 26, 2021

Easy and Efficient Transformer : Scalable Inference Solution For large NLP model

arXiv:2104.12470v5627 citations
Originality Incremental advance
AI Analysis

This addresses the problem of deploying large NLP models efficiently for industrial applications, though it appears incremental as it builds on existing optimization libraries.

They tackled the high inference costs of large transformer models in production by introducing Easy and Efficient Transformer (EET), achieving a 1.40-4.20x speedup compared to Faster Transformer v4.0 on an A100 GPU.

Recently, large-scale transformer-based models have been proven to be effective over various tasks across many domains. Nevertheless, applying them in industrial production requires tedious and heavy works to reduce inference costs. To fill such a gap, we introduce a scalable inference solution: Easy and Efficient Transformer (EET), including a series of transformer inference optimization at the algorithm and implementation levels. First, we design highly optimized kernels for long inputs and large hidden sizes. Second, we propose a flexible CUDA memory manager to reduce the memory footprint when deploying a large model. Compared with the state-of-the-art transformer inference library (Faster Transformer v4.0), EET can achieve an average of 1.40-4.20x speedup on the transformer decoder layer with an A100 GPU

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes